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Abstract 

Recent theoretical work ( |[T5l |9l) in graphical models have introduced classes of 
flexible multi -parameter Wishart distributions for high dimensional Bayesian infer- 
ence. A parallel analysis for DAGs or Bayesian networks, arguably one of the most 
widely used classes of graphical models, is however not available. For Gaussian DAG 
models the parameter of interest is the Cholesky space of lower triangular matrices 
with fixed zeros corresponding to the missing arrows of a directed acyclic graph Q. In 
this paper we construct a family of DAG Wishart distributions that form a rich conju- 
gate family of priors with multiple shape parameters for Gaussian DAG models, and 
proceed to undertake a theoretical analysis of this class with the goal of posterior infer- 
ence. We first prove that our family of DAG Wishart distributions satisfies the strong 
directed hyper Markov property. Operating on the Cholesky space we derive closed 
form expressions for normalizing constants, posterior moments, Laplace transforms 
and posterior modes, and demonstrate the use of the DAG Wishart class in posterior 
analysis. We then consider submanifolds of the cone of positive definite matrices that 
correspond to covariance and concentration matrices of Gaussian DAG models. In 
general these spaces are curved manifolds and thus the DAG Wisharts have no density 
w.r.t Lebesgue measure. Hence tools for posterior inference on these spaces are not 
immediately available. We tackle the problem in three parts, with each part building 
on the previous one, until a complete solution is available for ALL DAGs. 

In Part I we note that when Q is perfect, associated covariance and concentration 
spaces are open cones and hence we proceed to derive the induced DAG Wishart dis- 
tribution on these cones. A comprehensive analysis is however only possible for the 
class of perfect DAGs. In Part II we formally establish that for any non-perfect DAG, 
covariance and concentration spaces have Lebesgue measure zero in any Euclidean 
vector space containing them, and hence the DAG Wishart family introduced above 
does not have a density w.r.t. Lebesgue measure for Q non-perfect. We therefore 
propose a unified approach for all Gaussian DAG models by appealing to the theory 
of Hausdorff measure theory. First we derive the functional form of the DAG Wishart 
density w.r.t Hausdorff measure. We demonstrate however that even for the simplest of 
graphs, the Hausdorff density is not amenable to posterior analysis. In part III we de- 
fine new spaces that are projections of covariance and concentration DAG spaces onto 
Euclidean space that yield natural isomorphisms. We exploit this bijection to derive 
the densities of DAG Wishart and DAG inverse Wishart distributions w.r.t Lebesgue 
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measure, and thus avoid recourse to Hausdorff densities. We demonstrate that this 
third approach is extremely beneficial and is readily amenable for high dimensional 
posterior analysis. We derive hyper Markov properties and posterior moments for 
DAG Wishart and inverse Wishart distributions corresponding to arbitrary DAGs, and 
not just for the class of perfect DAGs. 

1 Introduction 

Graphical models yield compact representations of the joint probability distribution of a 
multivariate random vector, and they have therefore proved to be very useful in discov- 
ering structure, especially in high-dimensional data. These models use nodes of graphs 
to represent components of a random vector, and edges between these nodes, to capture 
the relationships between these variables. In general these graphs can have three types of 
edges: directed, undirected or bi-directed. Undirected graphs are often used to represent 
association through conditional independences whereas bi-directed graphs are often used 
to represent marginal independences. Directed acyclic graphical models (DAGs), or some- 
times referred to as Bayesian networks^, are often used to represent causal relationships 
among random variables. Graphical Markov models corresponding to DAGs have useful 
statistical properties, especially in high dimensional settings. The joint probability den- 
sity function (pdf) of a DAG model factorizes according to the graph into the product of 
the conditional pdfs for each variable given its parents, and can thus lead to a substantial 
reduction in the dimensionality of the parameter space. Directed acyclic graphical models 
have also found widespread use in the biomedical sciences, social sciences and in computer 
science. Estimating the covariance or inverse covariance corresponding to such DAGs is 
therefore an important area of research, especially in high dimensional settings. 

From a theoretical statistics perspective DAG models correspond to curved exponential 
families, and are distinctly different from the standard undirected or concentration graph 
models, which correspond to natural exponential families. Unlike in the natural exponen- 
tial family setup, a default prior, as given by the Diaconis-Yvilsaker (DY) framework, is 
not available for a general DAG (see [9] for a thorough discussion). A connection between 
DAGs and undirected or concentration graphs can however be used to derive a default prior 
for a subclass called perfect graphs. Indeed if the graph is "perfect']!, such DAGs are said to 
be Markov equivalent to decomposable concentration graph models, i.e., they both capture 
the same set of conditional independences. This connection can be exploited in the sense 
that inferential tools for perfect DAGs can be borrowed from the decomposable concentra- 
tion graph setting. More specifically, in their pioneering work, Dawid and Lauritzen 
developed the DY prior for this class of models when the graph is decomposable. In partic- 
ular, they introduced the hyper-inverse Wishart as the DY conjugate prior for concentration 
graph models. This work has been extended by the recent methodological contributions by 
Letac and Massam [fT5l who develop a rich family of multi-parameter conjugate priors that 
subsumes the DY class. Both the hyper inverse Wishart priors and the "Letac-Massam" 

'Sometimes also called recursive Markov models. 
2 A concept that will be formally defined later. 
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priors have attractive properties which enable Bayesian inference, with the latter allow- 
ing multiple shape parameters and hence suitable in high-dimensional settings. Bayesian 
procedures corresponding to these Letac-Massam priors have been derived in a decision 
theoretic framework in the recent work of Rajaratnam et al. |fT9l . A parallel theory for the 
class of Gaussian covariance graph models, graphical models which encode marginal inde- 
pendencies, has been recently developed by Khare and Rajaratnam (dU, [fTOll ). With a few 
exceptions (see ll2TTl for instance), all of the above methodological contributions are for de- 
composable graph models. Furthermore, the results for undirected or concentration graph 
models cannot be carried over to DAGs when Q is no longer perfect. This is because the 
Markov equivalence property between DAGs and undirected graphical models breaks down 
in the sense that the DAG model, and the undirected graphical model, capture different set 
of conditional independencies when G is not perfect (or equivalently non-decomposable). 

The literature on graphical models in general, and DAGs in particular, is extensive and 
thus we do not undertake a literature review of the work in this area. We shall however 
briefly review work that is directly relevant to this paper when notation and terminology is 
introduced. 

The principal objective of this paper is to develop a framework for flexible high di- 
mensional Bayesian inference for Gaussian DAG models, i.e., for the class of Gaussian 
distributions that have the directed Markov property. The previously established classes of 
generalized multi-parameter Wishart distributions developed by Letac and Massam lfT5l in 
the concentration graph setting, and by Khare and Rajaratnam ([0 in the covariance graph 
model setting, are not directly applicable to the general DAG setting, though they provide 
useful insights, as will be demonstrated in this paper. 

For Gaussian DAG models the parameter of interest, denoted by ©g, is the space of 
lower triangular matrices with fixed zeros corresponding to the missing arrows of a directed 
acyclic graph Q. We introduce a rich class of generalized multi-parameter DAG Wishart 
distributions on ©g was proposed, and studied with the explicit goal of Bayesian inference 
in high dimensional settings. This family extends the classical Wishart distribution in the 
sense as the latter becomes a special case of our family of DAG Wishart distributions. A 
comprehensive analysis of this family of generalized Wishart distributions was possible 
for arbitrary DAGs when working on ©g. Indeed analytic expressions for posterior mo- 
ments, Laplace transforms, posterior modes and hyper Markov properties are established. 
The hyper Markov property in turn enables the explicit computation of expected values 
and Laplace transforms in the Cholesky parameterization. Unlike their concentration and 
covariance graph counterparts, we show that sampling from our Wishart distribution for 
an arbitrary DAG model does not require recourse to MCMC. Once more we note that for 
concentration graph models, sampling from the posterior can be done in closed form only 
for decomposable models. For covariance graph model, sampling from the posterior with- 
out recourse to MCMC can be done only for homogeneous graphs. We also show that our 
DAG Wishart distributions can be derived in an equivalent way using the general approach 
in under the so-called global independence assumption. The latter approach does not 
however immediately give a means to specify hyper-parameters that will correspond to our 
DAG Wishart distributions. We also provide a discussion of the fact that our DAG Wishart 
distributions are in general different from the Letac-Massam priors. However, when the 
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underlying graph Q is homogeneous, the Letac-Massam W P& , priors are a special case of 
our distributions. We also provide a discussion of the fact that our DAG Wishart distribu- 
tions are in general different from the priors introduced in Khare and Rajaratnam([|91 for 
co variance graph models. 

After introducing our class of flexible Hyper Markov laws we explicitly tackle the ques- 
tion of deriving tools for Bayesian inference for Gaussian DAG models. In order to do this 
we then consider submanifolds of the cone of positive definite matrices that correspond to 
covariance and concentration matrices of Gaussian DAG models. In general these spaces 
are curved manifolds and thus the DAG Wisharts have no density w.r.t Lebesgue measure. 
Hence tools for posterior inference on these spaces are not immediately available. We pro- 
ceed to tackle this problem in three parts, with each part building on the previous one, until 
a complete solution is available for ALL DAGs. 

In Part I we derive DAG Wishart densities for perfect DAGs. It was noted above that 
the space of covariance and concentration matrices corresponding to Gaussian DAGs are 
in general curved sub-manifolds of Euclidean space. When Q is perfect however, these are 
open cones and the induced DAG Wishart density on these cones can be derived. We then 
proceed to derive Laplace transforms and expected values in this setting. Computation of 
expected values of covariance and concentration matrices corresponding to DAG models is 
no longer possible with this approach, except when Q is perfect, as the space on which these 
matrices live are in general curved manifolds. We note that a comprehensive framework 
for Bayesian inference that goes beyond "perfect" graphs is however critical for practical 
applications. The induced Wishart and inverse Wishart distributions on concentration and 
covariance spaces for general DAGs require more sophisticated tools and is the subject of 
Parts II and III. 

In Parts II we undertake the endeavor of deriving the induced Wishart and inverse 
Wishart densities on covariance and concentration spaces corresponding to arbitrary DAGs. 
We first establish that for any non-perfect DAG, covariance and concentration spaces have 
Lebesgue measure zero in any Euclidean vector space containing it, and hence the DAG 
Wishart family n® a introduced in our previous work does not have a density w.r.t. Lebesgue 
measure. We propose to overcome this in two novel ways. First we derive the functional 
form of the density of ix^ a w.r.t Hausdorff measure by developing the appropriate tools 
which allow us to work on concentration spaces corresponding to DAGs. This approach 
entails working with curved manifolds and Hausdorff measures on arbitrary metric spaces. 
We then proceed to demonstrate that even for the simplest of graphs, the Hausdorff density 
is not amenable to posterior analysis. 

In Part III we define new spaces that are projections of covariance and concentration 
DAG spaces onto Euclidean space that yield natural isomorphisms. In particular, these new 
spaces termed as the space of "incomplete" covariance and concentration spaces corre- 
spond to functionally independent elements of the covariance and concentration matrix of 
Gaussian DAG models. Given incomplete matrices from these spaces, it is always possible 
to "complete" them in polynomial time, so that the completion corresponds to covariance 
and concentration matrices of Gaussian DAG models. We exploit these bijections to de- 
rive the densities of DAG Wishart and DAG inverse Wishart distributions w.r.t. Lebesgue 
measure and thus avoid recourse to Hausdorff densities. We demonstrate that the latter 
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approach is novel, extremely beneficial, and is readily amenable for high dimensional pos- 
terior analysis. We then proceed to establish hyper Markov properties and derive posterior 
moments for DAG Wishart and inverse Wishart distributions corresponding to arbitrary 
DAG models and not just for the class of perfect DAGs. In doing so we succeed in devel- 
oping a unified framework for all Gaussian DAG models - that is suitable for both perfect 
and non-perfect DAGs. Our approach also allows us to formally demonstrate that the class 
of inverse DAG Wisharts introduced in this paper naturally contains an important sub-class 
of inverse Wishart distributions for that was introduced by Khare and Rajaratnam [9 ] in the 
context of Gaussian covariance graph models. 

Tabled] summarizes the properties of the various mulit-parameter Wishart distributions 
that have been recently introduced to the mathematical statistics literature for use in Gaus- 
sian graphical models. It is clear from this table that the Wishart distributions introduced in 
this paper are applicable in all generality - and not just when the graph is perfect, or equiv- 
alently, decomposable, and in this sense very powerful. The ability to specify the induced 
Wishart distributions and posterior moments for arbitrary graphs is especially useful. 
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Table 1: Properties of Wishart distributions for the three classes of Gaussian graphical 
models. 

Abbreviations. ND: Non-decomposable, D/P: Decomposable/Perfect, H: Homogeneous. 

This paper is structured as follows. Section [2] introduces required preliminaries, no- 
tation and Section [3] formally defines Gaussian DAG models and parameterization corre- 
sponding to Gaussian DAG models. The introduction, preliminaries and parameterizations 
for DAG models are discussed in some detail to make the paper self-contained and for es- 
tablishing consistent notation. These sections can be skipped by a reader familiar with the 
subject matter. In Section [51 the class of generalized Wishart distributions for Gaussian 
DAG models are formally constructed. Conjugacy to the class of Gaussian DAG mod- 
els and necessary and sufficient conditions for integrability are established. Furthermore, 
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comparison to conjugate priors in concentration graph and covariance graph models is un- 
dertaken. Section[6]establishes hyper Markov properties for our family of priors. In Section 
[7] we evaluate Laplace transforms, posterior moments and posterior modes for our class of 
distributions corresponding to the Cholesky parameterization when G is an arbitrary DAG. 

Analysis of our DAG Wishart distributions on corresponding covariance and concentra- 
tion spaces with a view to developing tools for high dimensional Bayesian inference using 
the class of DAG Wishart distributions in three Parts. Part I considers the class of perfect 
DAGs and derives posterior quantities. Part I (Section© derives the induced DAG Wishart 
densities on covariance and concentration spaces for perfect DAGs. We the proceed to 
show that the expected values of the covariance and concentration matrix can be computed 
easily for perfect DAGs. 

Part II (Section HI) introduces derives the density of our priors w.r.t. Hausdorff measure 
when Q is arbitrary, i.e., no longer perfect. Part III (Sections [10] and Section fTT|) defines 
functionally independent projection of spaces of concentration and covariance matrices 
that correspond to arbitrary DAG models and proceed to derive the induced measure of 
our class of Wishart distributions on these spaces. Consequently we proceed to establish 
hyper Markov properties and derive the expected value of the covariance and concentration 
matrix for DAG models. We also demonstrate that when Q is no longer perfect the class 
of DAG Wishart distributions do not belong to the class of general exponential families. 
Section [12]concludes by summarizing the results in the paper. 

2 Preliminaries 

In this section, we give the necessary notation, background and preliminaries required in 
subsequent sections. 

2.1 Graph theoretic notation and terminology 

In this subsection, we introduce some necessary graph theoretic notation and terminology. 
Our notation presented here closely follows the notation established in ffT3ll . J6J. 

A graph Q is a pair of objects (V, E), where V is a finite set representing the vertices (or 
nodes) of Q\ and £ is a subset of V x V consisting of the edges. An edge (i, j) & Eh called 
directed if (j, i) £ E. We write this as i — > j and say that i is a parent of j, and that j is 
a child of i. The set of parents of a vertex j is denoted by pa(j), and the set of children of 
a vertex i is denoted by ch{i). The family of j, denoted by fa(j), is fa(j) = pa(j) U {j}. 
Two distinct vertices i and j are said to be adjacent if (z, j) or (j, i) are in E, i.e., if there 
is any type of edge, directed or undirected, between these two vertices. We write i ~ j if 
there is an undirected edge^l between vertices i and j and say that i is a neighbor of j, j is a 
neighbor of i, or i and j are neighbors. The set of neighbors of i is denoted by ne(i). 

More generally, for A c V we define pa(A), ch(A), ne(A) and bd(A) as the collection 

3 Note that in enumerating the number of edges of a graph, each undirected edge, though consisting of two 
pairs, counts only once. 



6 



of the parents, children, neighbors, and boundary respectively, of the members of A, but 
excluding any vertex in A: 

pa(A) = U ieA pa(i) \ A, ch(A) = U ieA ch(i) \ A, ne(A) = U ieA ne(i) \ A, 

An undirected graph, "UG", is a graph with all of its edges undirected, whereas a di- 
rected graph, "DG", is a graph with all of its edges directed. We shall use the symbol Q to 
denote a general graph, and make clear within the context in which it is used, whether Q is 
undirected or directed. 

We say that the graph Q' = (V',E') is a subgraph of Q = (V,E), denoted by Q' c @, 
ifV'cV and E' c E. In addition, if @' c Q and E' = V x V n E, we say that §' is an 
induced subgraph of Q. We shall consider only induced subgraphs in what follows. For a 
subset A c V, the induced subgraph Q A = (A, A x A n E) is said to be the graph induced 
by A. A graph Q is called complete if every pair of vertices are adjacent. A clique of Q 
is an induced complete subgraph of Q that is not a subset of any other induced complete 
subgraphs of Q. More simply, a subset A c V is called a clique if the induced subgraph Q A 
is a clique of Q. 

A path of length k > 1 from vertex i to j is a finite sequence of distinct vertices vo = 
Vk = j in V and edges (v , vi), . . . , (v^_i, vt) e E. We say that the path is directed if 
at least one of the edges is directed. We say i leads to j, denoted by i i — > j, if there is a 
directed path from i to j. A graph Q = (V, E) is called connected if for any pair of distinct 
vertices i, j e V there exists a path between them. An n-cycle in Q is a path of length n with 
the additional requirement that the end points are identical. A directed n-cycle is defined 
accordingly. A graph is acyclic if it does not have any cycles. An acyclic directed graph, 
denoted by DAG (or ADG), is a directed graph with no cycles of length greater than 1 . 

The undirected version of a graph Q = (V, E), denoted by Q u = (V, E u ), is the undirected 
graph obtained by replacing all the directed edges of Q by undirected ones. An immorality 
in a directed graph Q is an induced subgraph of the from i — > k < — j. Moralizing an 
immorality entails adding an undirected edge between the pair of parents that have the 
same children. Then the moral graph of Q, denoted by Q m = (V, E m ), is the undirected 
graph obtained by first moralizing each immorality of Q and then making the undirected 
version of the resulting graph. Naturally there are DAGs which have no immoralities and 
this leads to the following definition. 

Definition 2.1. A DAG Q is said to be "perfect" if it has no immoralities; i.e., the parents 
of all vertices are adjacent, or equivalently if the set of parents of each vertex induces a 
complete subgraph of Q . 

Given a directed acyclic graph (DAG), the set of ancestors of a vertex j, denoted by 
an(j), is the set of those vertices i such that / i — > j. Similarly, the set of descendants of 
a vertex /, denoted by de(i), is the set of those vertices j such that i i — > j. The set of 
non-descendants of i is nd{i) = V \ (de(i) U {/}). A set A c V is called ancestral when A 
contains the parents of its members. The smallest ancestral set containing the subset B of 
V is denoted by An(B). 
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2.2 Decomposable graphs 



An undirected graph Q is said to be decomposable if no induced subgraph contains a cycle 
of length greater than or equal to four. The reader is referred to Lauritzen ifPSI for all the 
common notions of decomposable graphs that we will use here. One such important notion 
is that of a perfect order of the cliques. Every decomposable graph admits a perfect order 
of its cliques. Let {C\, ■■■ , C k ) be one such perfect order of the cliques of the graph Q. The 
history for the graph is given by H x = Ci and 

Hj = d U C 2 U • ■ • U Cj, j = 2, 3, • • ■ , k, 

and the (minimal vertex) separators of the graph are given by 

Sj = H hX n Cj, j = 2,3, • • • ,k. 

Let 

Rj = CjXHj^ for j = 2,3,- ■■ ,k. 

Let k 1 < k - 1 denote the number of distinct separators and v(S ) denote the multiplicity of 
S , i.e., the number of j such that S j = S . Generally, we will denote by the set of cliques 
of a graph and by 5?g its set of separators. 

2.3 Markov properties for directed acyclic graphs 

Let V be a finite set of indices and (X,) ;e y a collection of random variables, where each X t 
is a random variable on the probability space X t . Let the probability space X be defined 
as the product space X = x ieV Xi. Now let Q = (V,E) be a DAG. For simplicity, and 
without loss of generality, we always assume that the given DAG Q is connected and the 
edge set E contains all the loops (i, i), i € We say that a probability distribution P on X 
has the recursive factorization property w.r.t. Q, denoted by DF (the directed factorization 
property), if there are cr-finite measures ix t on Xj and nonnegative functions k\x h x pa{i) ), 
referred to as kernels, defined on Xf a ^ such that 

J k l (yi,x Mi) )djXi(yi) = 1, Vi e V, 
and P has a density p, w.r.t. the product measure /i = ® ie yju,-, given by 

ieV 

In this case, each kernel k'{xi, x pa ^) is in fact a version of p(Xi\x pa ^), the conditional dis- 
tribution of X t given X pa{i) . An immediate consequence of this definition is the following 
lemma. 

4 For convenience we draw the graphs without their loops. 
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Lemma 2.1. (from / fiJl/ ) /jf P admits a recursive factorization w.r.t. the directed graph Q, 
then it also admits a factorization w.r.t. the undirected graph Q m , and, consequently, obeys 
the global Markov propert^ w.r.t. Q m . 

Proof. Note that for each vertex i e V the set fa(i) is a complete subset of Q m . Thus if 
we define i/r /a(0 (x /fl (i)) = k'(Xi,x pa(i) ), then p(x) = Hiev P( x i\ x pa(iy) = Uiev k '(Xi, x pfl (i)) = 
Yliev ^fawixfad))- Therefore, P admits a factorization w.r.t. Q m and by proposition 3.8 in 
IPT31 it also obeys the global Markov property w.r.t. Q m . □ 

Another direct implication of the DF property is that if P admits a recursive factoriza- 
tion w.r.t. Q, then, for each ancestral set A, the marginal distribution P A admits a recursive 
factorization w.r.t. the induced graph Q A . Combining this result with Lemma [27T1 we ob- 
tain the following: P admits a recursive factorization w.r.t. Q then A ± B\S [P], whenever 
A and B are separated by S in (@An(AuBus)) m ■ We call this property the directed global 
Markov property, DG, and any distribution that satisfies this property is said to be a di- 
rected Markov field over Q. For DAGs the directed Markov property plays the same role 
as the global Markov property does for undirected graphs, in the sense that it provides an 
optimal rule for recovering the conditional independence relations encoded by the directed 
graph. 

We now introduce below another Markov property for DAGs. A distribution P on X is 
said to obey the directed local Markov property (DL) w.r.t. Q if for each i e V 

i _L nd(i)\pa{i). 

Now for a given DAG Q consider the so-called "parent graph" defined as follows: The 
parent graph Q par of Q is a DAG isomorphic to Q and obtained by relabeling the vertex 
set V as 1, 2, . . . , | V|, in such a way that pa(i) c [i + 1, . . . , |V|} for each vertex i e V. It 
is easily shown that for any given DAG it is possible to relabel the vertices so that parents 
always have a higher numbering that their respective children though such an ordering is 
not unique in general. For a given parent ordering we say that P obeys the parent ordered 
Markov property (PO) w.r.t. Q if for every vertex i we have 

i ± {i+l,...,\V\}\pa(i)\pa(i). 

It can be shown that if P has a density w.r.t. p., then P obeys one of the directed 
Markov properties DF, DG, DL, PO if and only if it obeys all of them, i.e., the four Markov 
properties for DAGs are equivalent under mild conditions IfTBI . 

3 Gaussian directed acyclic graphical models 

In this section we focus on multivariate Gaussian distributions which obey the directed 
Markov property w.r.t. a DAG Q. From now on and unless otherwise stated, we shall 
always assume without loss of generality that Q = (V,E) is given in a parent ordering. 

5 see lfT3l for definition. 
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A Gaussian Bayesian network over Q (or Gaussian DAG over Q), denoted by JY{@), is 
the statistical model that consists of all multivariate Gaussian distributions N m Qx, Z) which 
follow the directed Markov property w.r.t. Q where fi e R m and Z 6 PD m (R), the set of 
m x m real positive definite matrices. 



3.1 Linear recursive properties of Gaussian DAGs 



Let x = (xi, . . . , x m )' be a random vector in R m with the multivariate distribution N m (0, Z). 
Consider the system of linear recursive regression equations: 



*1 + 012*2 + 013*3 + - ' - + 0lm x n 
*2 +023*3 + - ' - + 02m*n 



€i or equivalently X\ 

e 2 x 2 



-012*2 -013*3 ~ " ' " ~ 01m*m + £\ 
-023*3 - ■ ■ ■ - 02m*m + ^2 



where -0 t j is the partial regression coefficient of Xj (j > i) in the regression of x\ on 
its predecessors x,-+i, . . . , Xp . . . , x m . Now { j is zero if and only if i if {i + 1, . . . , \ 
pa(i) | pa(i) . Hence the partial regression coefficient t j is zero if there does not exist an 
arrow from j to i, i.e., j £ pa(i), j > i. In addition, the residuals e, are normally distributed 
and mutually independent with mean zero and variance ^ pa(i y We can rewrite the first 
system of equations in the form of a linear system Bx = e, where B is the upper triangular 
matrix 



B = 



fl 012 


• • • 01m 




X\ 
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■ ■ ■ 02m 
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and e = 
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(o ... 
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From this we obtain: 



Yar[Bx] = Yar[e] 

.2 

l\pa(iy 



>B'LB t = diag{cr] 



=>E = B- l D(B'y x 
^S" 1 = B'D- l B. 



cr 



m-l\pa(m-l)'> ®~mm) 



D 



(3.1) 



Thus, if we define L = B', then IT 1 = LD~ l L' is the so-called modified Cholesky decompo- 
sition of Z" 1 , in terms of the lower triangular matrix L and the diagonal matrix D~ l . Now 
consider a DAG denoted by Q = (V, E). In ll25l it has been shown that N m (0, Z) obeys the 
directed Markov property w.r.t. Q if and only if L t j = whenever there is no arrow from i 
to j, i.e., i t pa(j). Equation (13.11) above therefore gives a very convenient description of 
the Gaussian Bayesian network JY(Q). We explore this model in more detail below. 
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4 Parameterizations of JV(@) 



In this subsection we discuss two parameterizations of the Gaussian Bayesian network 
jV{0) that is of use in subsequent analysis. Let a,b c V% and let R axb denote the real 
linear space of functions 

T = ((/, j) i-> Tij) :axb^R. 
Each element of R axb is called an \a\ x \b\ matrix. In particular, we define the space of 



{re 



: T, 



Tjj Vz, j 6 a}, and the set of positive definite 



symmetric matrices S a (R) 
matrices as 

PD a (R) = {T e 5 fl (R) : gTg >0:V^r\ {0}}. 

Now let the sets a, be a partition of the set V, then the positive definite matrix 2 e PD m (R) 
can be partitioned into block matrices as follows: 



2 fl 

2fc a 



2 a £ 

2. 



vaxb 



where 2 flfl = (2,- ;),,,,, e PD fl (R), !«, = (2//)^/, e PD,(R), 2^ = (Sy)^^ e 
and 2^ = 2^ fc . The Schur complement of the sub-matrix 2 afl is defined as l. bb \ a = l, bb - 
2fo a (2 afl ) £ a fo. 

Remark 4.1. Throughout this paper, we usually suppress the notation for a principal sub- 
matrix 2, aa and refer to it as 2 a . We shall also use the convention 2" 1 for (2 fla ) _1 and 2 a for 

a '),,, 

We now recall basic results from standard multivariate statistical theory. If x ~ N m (jx, 2), 
v , then x a ~ N a (jj. a , 2 a ) and for any x b e R fe , the conditional distribution of x a \x h 



H e 



Xb 



is given by N a (ji a + Y. ab L~ b \x b - fi b ),^aa\b), i.e., 2 afo 2~ 1 is the regression coefficient of x b 
in the regression of x a on x b , and l, aa \ b is the conditional variance of the residual. More 
generally, for a partition a, b, c of V denote the corresponding block partition matrices of 2 
and E _1 as follows: 









2 ^ 








2 ac> 


2 = 


^ba 


*b 


2& c 


and 2 _1 = 


2 fa ' 


2 fe 


2 for 






2 C & 


2 C J 






£C 6 





Then the partial covariance matrix of both of x a and x c given x h = x b is denoted as 



I Mb ^ ^ = 2 ^ _ i 26cj an( j Sca|i = . In particular, 

\^ca\b ^cc\bj 

a 11 c|& implies that 2 ac | fo = » 2 fl 



2a&2^ ^bc- 



(4.1) 



Next, let PD^ denote the set of positive definite matrices 2 in PD m (R) such that 
N m (ju, 2) 6 for every /u e R m . Clearly, N m (ju, 2) 6 if and only if N m (0, 2) e 

JY(Q). Therefore, without loss of generality, we shall only consider centered Gaussian 
distributions {N m (0, 2) : 2 e PD^} c Jf (Q). For convenience however, we shall still 
denote the submodel {N m (0, 2) : 2 e PD^} by 



6 Note that under-case alphabets are used to denote subsets of V. 
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4.1 D-parametrization 

A parameterization ("D-parameterization") of jY{Q) can be obtained by using the recursive 
factorization property of the Gaussian densities in First we recall the following 

notation from 0J. 

Notation. For each i e V let 

< i >= pa(i), [i >= {i} x pa{i), < i] = pa(i) x {/}, 

^ i ^= U '■ J > z ) \ pa(i), [i $■= {/}x ^ i < i i > X ^ i $■ 

< z >= /a(z) 

By applying the directed factorization property (DF) of N m (0, E) e .jV(Q) we have 
dN m (0, E)(x) = Y\ dN(pn pa(i) , 2i\ pa& )(Xi\x pai ) = Y\ dN&fcl^x^, E^^Xx,-), (4.2) 

i'eV ;eV 

for each x = (Xj)j 6 v e R m - Note that N(E [ ,vZ~^x <! v, is the conditional distribu- 

tion Xjlx^v = x <t> . Furthermore, using the exact functional form for the densities of the 
Gaussian distributions in Equation (14.21) . we obtain the following expression 

tr^xx') = J] tr (l^Cxi - Ep^^X^ - E^E^^y/) . (4.3) 

ieV 

It is shown in [1J that E 6 PD^ if and only if E 6 PD m (R) and satisfies Equation (14.31) for 
all x 6 W". On the other hand, by the parent ordered Markov property (PO) of N m (0, E) we 
have E 6 PD^ if i _M z ^ | < i > (or equivalently i ± {i + 1, . . . , m} \ pa(i)\pa(i) ). Another 
characterization given by Q~| for E e PD^ is that E e PD m (R) and 

E fi> = E [iV E^E^ y , Vi 6 V. (4.4) 

Using the insights above and defining Eg = X; e y(R+ x R <,] ), it can be shown that the 
mapping 

(E i * XfeyCE^E^E^)) : PD^ -> % 

is a bijection. In order to construct the inverse of this mapping let x, e y(/t,-, /?[,>) be a typical 
element in Eg, with the convention that = whenever -< i >= pa(i) = 0. The 
corresponding E can be recursively constructed starting from the largest index m, by setting 

E^^+^E^,], 

ii) E^j = E^^ij, and (4.5) 
Hi) E[,y = E^E'-^E^, according to Equation (14.41) 

The reader is referred to [1J for greater detail, where in addition, it is shown that the above 
inverse mapping yields a positive definite matrix in PD,„(R), and consequently in PD^. We 
shall revisit the "D-parameterization" mapping in subsequent sections. 
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4.2 Cholesky parameterization 



An alternative parameterization ("Cholesky-parameterization") of jV (Q) can be obtained 
by using the Cholesy decomposition of From §13.1l it is clear that ^V{Q) is isomorphic 
to the following family of distributions: 

= {N m (0, (L')~ l DL~ l ) : (D,L) e ©g}, 

where ©g = x Jig, with Jig being the set of lower triangular matrices L in R. mxm such 
that 



1 



i = J 



Lij i > j and i e pa(j) 



otherwise 

and £)+ the set of diagonal matrices in R mxm such that D u > 0. Note that 2T 1 = LD~ l L' and 



that if i i pa(j) then L, ; = 0. 



Remark 4.2. Note that the D-parameterization and and Cholesky parametrization of a Gaus- 
sian DAG model Ng are essentially the same, as they both encode the partial regression 
coefficients and residuals in the system of regression equations described in Subsection 
13.11 The difference between the two parameterizations are subtle. The /3 [(> elements of the 
D-parameterization constitute the non-zero off-diagonal elements of the columns of the L 
matrix in the Cholesky parameterization whereas the At in the D-parameterization is the Da 
in the Cholesky parameterization. The latter parameterization is essentially a matrix version 
of the former. The zero and non-zero elements of the L matrix in the Cholesky parameter- 
ization encodes the graph corresponding to the DAG, whereas in the D-parameterization 
the information regarding the parents of each vertex is assumed to implicitly accompany 
the parameters. We shall see later that each parametrization has its respective advantage. 
While computations involving the D-parameterization are often easier, the Cholesky de- 
composition intuitively encodes the structure of the underlying DAG and provides a natural 
description of the Wishart DAG densities introduced in the paper. 



4.3 Covariance manifolds 

In this subsection we introduce parameter spaces that correspond to general Gaussian con- 
ditional independence models. We note that the spaces defined here are for more general 
conditional independence models and not just for DAG models. The reason for this is three 
fold. The first stems from a desire to give a unified treatment of the parameter spaces for 
DAGs and undirected graphical models (UGs). Second, a more general definition will al- 
low us to compare the distributions introduced in this paper for DAGs to those that have 
been previously proposed in the literature for UGs. Third, for special class of graphs (such 
as decomposable or perfect graphs), DAGs and UGs coincide in terms of the conditional 
independences that such graphs can encode. In these instances, the ability to exploit the 
equivalence between DAGs and UGs requires definitions of parameter spaces that are more 
general. 
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We shall use the term general Markov model (or GM model), to refer to the statistical 
models which are defined in terms of a set of conditional independence constraints. Two 
GM models are said to be "Markov equivalent" if every distribution that satisfies the re- 
quired conditional independence assumptions in one model will satisfy those in the other 
and vice versa. An important example of Markov equivalence classes are the class of per- 
fect DAGs and the class of decomposable undirected graphs, i.e., for a perfect DAG Q the 
class of GM models over Q coincides with that of GM models over the decomposable graph 
Q u , the undirected version of Q (see [13] for details). 

One important subclass of GM models over an arbitrary graph Q = (V, E), directed or 
undirected is the general Gaussian graphical model over Q, denoted by jV (Q) and repre- 
sented by the class of multivariate normal distributions N m (0, 2) obeying the corresponding 
Markov properties w.r.t. Q. More formally, let Q = (V, E) be a graph (directed or undi- 
rected) and JV{<§) the general Gaussian graphical model over Q. Consider the following 
definitions. 

Definition 4.1. (a) PDg is the set of positive definite matrices Z in PD m (R) such that 
N m (0,£)e^(£). 

(b) Pg is the set of positive definite matrices Q, such that Q. 1 e PP>g. 

(c) Zg is the real linear space of m x m symmetric matrices T of dimension m such that 
Tij = Tji = if (i, j) is not in E. 

(d) lg is the real linear space of symmetric functions r = (T : (i, j) i-» r, ; ) : E u — > R, 
i.e., r, 7 = Tji. An element r e \g is called a ^-incomplete (symmetric) matrix. 

(e) Qg is the set of ^-incomplete matrices r 6 \g such that r c is positive definite for each 
clique ce?g. Elements of are said to be partially positive definite matrices over 

Q. 

Remark A3. Note that lg is naturally a real linear subspace of S m (R). If Q = (V, E) is a 
DAG, then lg is an |,E|-dimensional linear subspace of S m (R). 

Remark 4.4. We emphasize that these definitions are more general than those introduced 
in [fT51 and used in 031, as Q can now be either directed or undirected. However, when the 
graph Q is undirected the two definitions do coincide, and thus the definitions above are in 
some sense extended versions of those in [15J. In particular, for a undirected decomposable 
Q we have Pg = |fl 6 PD m (R) : coij = if (z, j) £ E}. Moreover, when Q is a perfect DAG 
then Pg (Qg) and Pgu (Qgu) are identical due to the Markov equivalence property of perfect 
DAGs and decomposable undirected graphs. 

We also introduce additional notation that is required in subsequent sections: 

Notation. Let Q = (V, E) be a DAG. For a symmetric matrix T 6 S m (R), we denote by T E 
the projection of lonl^. The projection mapping (T i-> T E ) : S m (R) — > lg is denoted by 
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5 Generalized Wishart laws for DAG Models 



In this section we introduce new classes of matrix variate distributions for parameters of 
interests in the Gaussian DAG model. 



5.1 The DAG Wishart distribution on Q g 

The modified Cholesky decomposition provides a natural description of Gaussian DAG 
models and hence we start by developing distributions on the space Qg (as defined in 
^U). It is assumed henceforth, unless otherwise stated, that Q = (V, E) is a DAG and that 
the vertices in V = {1, . . . ,m} are parent-ordered^ i.e., pa(i) c {/ + 1, . . . ,m} for each 
i = 1, ... ,m - 1. Recall that the Gaussian DAG model associated with Q is the family of 
distributions 

^{Q) = {N m (0,I) : I e PD^} = {N m (0, {L~ l ) T DLT 1 ) : (D,L) e Q g }. 

Consider the family of measures on ©g with density (w.r.t. Y\i>j,(ij)eE dL { j Y\T=i dD t A 

i m 
n„J.D,L) = exp{--tr((LD- 1 L')f/)} Y\d^°", (D,L) e & g . (5.1) 



;=1 



This family of measures is parameterized by a positive definite matrix U and a vector 
a 6 R m with non-negative entries. Let 

C C 1 m 

Zg(U,a):= I n^ a dUD= I exp{--tr(LD~ l L'U}Y] D'^'dLdD. 



ie g J&g 

If Zg(U, a) < oo, then n Ua can be normalized to obtain a probability measure. A necessary 
and sufficient condition for the existence of a normalizing constant for ANY arbitrary DAG 
is obtained in the following theorem. 

Theorem 5.1. Let dL := Y\(ij)eEj>j dL t j and dD := Y\T=i dDu denote, respectively, the 
canonical Lebesgue measures on Jig and W" and let pai := \pa(i)\. Then, 



\ m ! 

exp{--tr(LD- 1 L't/)} []D^ ffi JWD < oo 
if and only if 

cti > pa t + 2 Vz = 1 , . . . , m. 

Furthermore, in this case 



<*i PH 3 
2 2 2 



„ r(f - f - l)2T-i(V^r'det([/ <iV ) f - 
Zg(U,a) = M . (5.2) 



det(t/^>)T 



7 We emphasize here that unlike in the decomposable concentration and covariance graph setting (where 
the existence of an ordering is important either for the perfect order of cliques and separators, or to preserve 
zeros), existence of such an ordering is not necessary in the DAG setting, since a parent ordering is always 
available for a DAG. 
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Proof. The proof of this theorem is given in the Appendix/Supplemental section. 



□ 



Note that the expression above for the normalizing constant bears close resemblance to 
Corollary 3 of AH due to the Markov equivalence of DAGs and covariance graph models 
when is Q is a homogeneous graph. A through investigation of this parallel requires more 
tools and is undertaken in the sequel to this paper - see [J3]|. At a first glance the distribution 
defined in Equation (15.11 ) appears to be the same as the covariance Wishart distributions de- 
fined in [|9l, but a closer in-depth look reveals that they are different. Note that the product 
LD~ l U features in Equation (15.11) whereas the expression (LDU)~ l features in the den- 
sity of the covariance Wishart distributions defined in [9|. More importantly however, the 
closed form result above is valid for ALL DAGs and not restricted to any specific subclass 
of graphs such as perfect(or decomposable graphs) as in the treatment of undirected graphs 
|[T5l [T9l , or, the class of homogeneous graphs as in the treatment of covariance graphs 
[0. As will be seen later this property has significant consequences for Bayesian model 
selection in high dimensional settings. 

Definition 5.1. The normalized version of n Ua , denote by n® s a , will be referred to as the 
"DAG Wishart" distribution on &g with shape parameter U e PD m (K.) and multivariate 
scale parameter a = (a,, . . . , a m y e R m . The density of ^ v s a , when a,- > pat + 2 for every 
i e V, is given as follows: 



where the normalizing constant Zg(U, a) is given in Equation (15.21) . 

We now proceed to demonstrate that the class of matrix- variate DAG Wishart probabil- 
ity distributions defined above has important statistical uses. The following lemma shows 
that the family n" v B a is standard conjugate for Gaussian DAG models. 

Lemma 5.1. Let Q be an aribtrary DAG and let Yi, Y2, • • • , Y n be an i.i.d. sample from 
N m (0, (L~ l )'DL~ l ), where (D,L) e O^. Let S = - Yn=i Y,Y| denote the empirical covari- 
ance matrix. If the prior distribution on (D, L) is n® s a , then the posterior distribution of 
(D, L) is given by where U = nS + U and a = (n + a\ , n + «2, • • • , n + a m ). 

Proof. The proof is given in the Appendix/Supplemental section. □ 

Remark 5.1. The case when the observations do not have mean zero (i.e., when Yi, Y 2 , • • • , Y„ 
are i.i.d. N m (ju, 2), with /u G R m , 2 6 PD^) can be handled in a similar manner by 
noting that the sample covariance matrix 5 is a sufficient statistic for Z and the fact that 
nS ~ W m (n - 1,2). 



U,a ~ 




(=1 
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6 Hyper Markov properties 



This section explores the distributional properties of the family in detail. In particular, 
we demonstrate that there is deep and useful structure in the DAG Wishart distributions 
introduced in Section [51 with important implications for statistical inference. 



6.1 Strong directed hyper Markov properties of i^f a 

The conceptual foundations of hyper Markov properties were laid in [7] and the reader is 
referred to lfT5l for a brief overview. Consider the Gaussian DAG model ^V(Q) with the 
parameter space ©@. The elements of JV{§) are of the form N m (0, (X -1 ) 7 DL' 1 ), such that 
(D,L) e ©g. Recall that if x ~ N m (0, E), then for each i e V the distribution of x,-^,> is 
parametrized by (E^y, E^E^-j). Note furthermore that these parameters are related to the 
Cholesky parameterization as follows: D u = E„| <(> andL <(] = -E'^E^j (see HI Proposition 
11.1]). The following theorem establishes the strong directed hyper Markov property for 
the n^f family of DAG Wishart distributions. 

Theorem 6.1. Let Q be an arbitrary DAG and (D, L) ~ n^f. Then {(D u , L^) : i = 1, . . . , m) 

are mutually independent. Moreover, 

Du ~ IG(^ - - 1, -U tiHi> ), and (6.1) 

L <q \D u ~ NpaX-U^U^DaU^). (6.2) 
Proof. The proof is given in the Appendix/Supplemental section. □ 

Theorem 16.11 yields the marginal density of D, whereas for the L parameter, only the con- 
ditional distribution given the D are given. Hence we now proceed to derive the marginal 
density of L. 

Corollary 6.1. Let Q be an arbitrary DAG and suppose (L, D) ~ n® a . Then the density of 
L w.r.t. dL = Y\T=i dL<i\ is given by 



f] a [\/2U ii{<i> + (L <;] + u-lu^yu^iL^ + u^u <n )] 



-at/2+1 



1=1 



(6.3) 



where each Ci is given by 

daiU^yiWu^rl^-l^Yiaill - 1) 
2««/2-%TO/2r( Qri /2 - pat/2 - 1) 

In particular, for each i, has a multivariate t-distribution: 

L<i] ~ t pa , {-U'lU^iai/l - pat/2 - l)C4 Hi >t/~ v , a t - pa t - 2) , 

i.e., L <t ] has a parvariate t-distribution with mean parameter -U'lU^, scale parameter 
(at/2 - pcii/2 - \)U 'a^U^ and degrees of freedom v t := a t - pa t - 2. 

Proof. The proof is given in the Appendix/Supplemental section. □ 
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6.2 Alternative method for deriving the DAG Wishart distribution 

In light of Theorem 16.11 and following the general approach in [14J we show below that 
one can arrive at the DAG Wishart distributions 7t v s in an alternative way. Consider a gen- 
eral setting and suppose that {P g :e 0} is a family of directed Markov fields over a DAG 
Q = (V,E). For each vertex i e V let 0,- (or more accurately 0,| P a(o) be the corresponding 
parameter of the conditional probability of x ; |x pa(; ) and let 0, = {9 t : 6 e 0}. Therefore, 
under a re-parameterization, we may assume that = x'." j0,. Now suppose that for each 
vertex i, a prior 7r,(0j) is specified. Now under the global independence assumption, i.e., the 
assumption that the parameters {6 t : i e V} are apriori independent random variables, we 
have tt(6) = YYILi n i(Qd- Note that global independence assumption is the same as assuming 
the strong hyper Markov property on the parameter 6 (see 01). 

Now consider a Gaussian DAG model given as {N m (0, 2) : 2 e PD^)}. Recall that 
if x ~ N„,(0, 2), then the distribution of x,|x^> is parameterized by (S,,-^^, 2"-^ £<,]). 
This suggests a re-parametrization of the original family of distributions using the D- 
parameterization introduced in Section |4~T1 

Eg = 1x^(2^,2^2^) : 2 e PD^}. 

Since x,-|x <;> ~ N(2[ I y2~^x^ ! >, 2 ;1 -^ !V ), a natural approach to prior specification for the pa- 
rameter set (2 n |^,v, 2~| > 2 <! ]) is the standard conjugate prior, i.e., the inverse-gamma distri- 
bution for 2,j|mv and a Gaussian distribution for 2~; > 2 <1 -]|2;,-| < ,y. More precisely if 

2 i7Hl> ~ IG{— - — - 1, -f/, 7Hi v), and 

2 <( - > 2 < /]|2,','| <( y ~ Npa^U^U^], 2,',| <( vt/^ ; y), 

then the global independence assumption implies that the distribution of X™ j (£u]<t>- > £^£-<i]) 
is proportional to 

exp{-i J] ^|U^L^] ~ UtU^U^V-J^ - U-lU^) + 2j !V t4 H;> } Y] 2",^ • 

ieV ieV 

It is now clear from the proof of Theorem 16.11 that the DAG Wishart distributions 
defined in Equation (112.21) is nothing but the image of this distribution under the map : 
Eg -> O^. 

Remark 6.1. We note that though the class of DAG Wishart distributions can be derived in 
an equivalent way using the general approach in [7], the latter approach does not however 
immediately give a means to specify hyper-parameters that will exactly correspond to our 
DAG Wishart distributions. Hence it is important to note that the equivalence is easier 
to recognize once hyper Markov properties for our DAG Wishart distributions have been 
established. 
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7 Laplace transform, expected value and posterior mode 



In this section we compute Laplace transforms, expected values and posterior models for 
the class of DAG Wishart distribution defined in this paper. 



7.1 Laplace transforms 



We start with computing the Laplace transform of iC^ by exploiting the results established 
in Theorem 16. II First a preliminary result on the Laplace transform of a Gaussian inverse 
Gamma distribution is required. 

Lemma 7.1. Suppose (A, x) is a random variable with Gaussian-inverse gamma distribu- 
tion: 

x\A ~ N p Qi, A}¥), peR p ,^e PD P (R); 
A~IG(v,T]). 



Then the Laplace transform of (A, x) at (£, u) € 



1+ X KT + is 



exp{j//j} L(g - ^u ty ¥u) ) K v 



2 --«'¥«) 



where K v (-) is the modified Bessel function of the second type and % - ^u tx ¥u is assumed to 
be positive. 

Proof. The proof is given in the Appendix/Supplemental section. □ 



Proposition 7.1. The Laplace transform ofn^f a at a typical point X™^,-, z<i\) e is given 
by 



£ Se (xr =1 (6,^])):=2 m [] 



i'i 



expiz'^p^} Life - ^z'^^z^) ] K n 



where r t = %-%jL-l, rji = ^U iiHi> , p <t] = -Uj^U^, = Uj^ and £— <i>Z<i\ 

are assumed to be positive for each i. 

Proof. Let x^C/l*, /?<,-]) ~ n^f a . Theorem 16. 1 1 implies that the finite sequence of random 
variables (/i,-,/3^]) are independent and each has a Gaussian-inverse gamma distribution as 
given by Equation (16.11) and Equation (|6.2I) . It therefore suffices to compute the Laplace 
transform of each random vector (/!,-,/?<;]) individually. The Laplace transform of if^ now 
follows immediately from Lemma I7T1 □ 

We now proceed to give the Laplace transform of n® a . 

Corollary 7.1. The Laplace transform ofn® a at (A, Z) 6 0^ is given by 



n 



1=1 



1 \ / i v r ' 1 



Proof. The proof is given in the Appendix/Supplemental section. 



□ 
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7.2 Expected values 

We now proceed to compute the expected values of our priors. First some necessary nota- 
tion is introduced: Suppose a,b c V and A 6 W xh a matrix of size \a\ x \b\. Then define 
(A) e R VxV by 

\Ajj i 6 a,j € b 



(At 



otherwise. 



Furthermore, if L <t] is a vector in R <1 \ then we consider 



1 



as a vector in R- !] with 1 in ii position. 



Now recall from l6.1l that L <t } has a multivariate t-distribution. This result readily allows 
us to compute the mean and covariance of the random elements of L. They are given as 
follows: 



E(Z^j) = -U^U <n and Var(L <n ) = 



2vi - 4 



Consequently, if A := {1, . . . , i r } c V is the set of vertices i such that pa(i) + 0, then 
E(x igA L <(] ) = - x ieA U'j^U^i]. This can be expressed in matrix form as follows: 



m m i ^ \ m t 

The expression for Var(x ieA ^<i]) is given by the block diagonal matrix 

/ v? 












'2 J J TT-1 

2v h -4 u hh\<h> u <i 2 > 



i — 1 



>> 



The expected value of D can also be easily computed using the result in Equation (16.11) . Un- 
der the Cholesky decomposition parameterization we have E[D] = Diagf — — : i 6 V \ 

\ai-pcii-A j 



7.3 Posterior modes 

We now proceed to compute the posterior mode of n^f a as this is often a useful quantity in 
Bayesian inference. The computation of the posterior modes under other parameterizations 
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follow from similar calculations. First let us compute the mode of n^f a . Recall that from 
Equation (112.41) the density of n^f a is proportional to 

exp{-l Yj + U^U^yU^QS^ + U^U <n )} expi—Aj'U^} x} m . 

ieV ieV 

It is clear that for each A t the factor expl-^r 1 ^] + U^U^yU^ifi^ + £7^£7 <;] )} is 
maximized at p <{ \ = -U'^U^. Note also that expl-^/t^f/^y} FL-ev^, 2< *' corresponds to 

the distribution IG(a t /2 - 1, t/„k,v/2) and thus its mode is equal to ^"^ <l> . Combining the 

a < 

above two results the mode of n~^ a is given by 

The following result on the posterior mode of n~^ a now follows immediately from the 
above calculations. 

Proposition 7.2. Let Y 1; Y 2 , • • • , Y„ be i.i.d. observations from a centered normal dis- 
tribution parametrized by Eg with prior iC^ , and let S = - Ym=\ YjY? be the empirical 

covariance matrix. From Lemma \5A\ the posterior distribution is equal to n^s+ua+n w ^ 
posterior mode given as follows: 

/ frS + E/W _ (nS<i> + u<i>y : {nS + v } 
\ OL t + n 

PART I: DAG Wishart densities for perfect DAGs 



8 Induced priors on Vg and Qg for perfect DAGs 

The prior n^f a on Qg (the modified Cholesky space) induces a prior on Pg. When Q is 
a perfect DAG, the induced prior on Pg can be evaluated in a relatively straightforward 
manner. Recall from §!4.3l that Pg is the space of positive definite matrices Q s.t. Q 1 e PD^. 
As pointed out earlier in §14.31 when Q is a perfect DAG Pg corresponds to the space of 
positive definite matrices with zero restrictions according to the decomposable graph Q n . 
We now provide an expression for the induced prior on Pg to enable comparisons between 
our DAG Wishart distributions and other classes of distributions that have been introduced 
in the literature. Note that the when Q is a perfect DAG the bijective mapping 

\fr := ((D,L) i-> LD~ l L f ) :®@^>Pg (8.1) 

is a diffeomorphism between two open subsets of Euclidean space. The lemma below 
provides the Jacobian required for deriving the induced priors on Pg. Note that for a c V 
the number of elements of a is denoted by \a\. 
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Lemma 8.1. Let Q be a perfect DAG. Then the Jacobian of the mapping i// = ((D, L) i-» 
LD~ l U) : &g — > Pg is equal to 

m 

n^r ;+2) - (8 - 2) 

7=1 

A variant of the proof of this lemma can be found in [[271 [9). We nevertheless give 
a proof in the Appendix for completeness and because the mapping under consideration 
is slightly different. Furthermore the notation in the proof is important for subsequent 
sections. 

Lemma [5TT1 allow s us to write the density of the induced prior on Pg as follows: 

-. m 

7T^(0) := zg(U, aT l exp{--tr(QC/)} U 0,(0)" * w +2 . (8.3) 
Here D u = (Q.~ l ) 

a^iV is considered as a function of Q. 
Recall that if Q is a perfect DAG, then Q u is decomposable and the sets Pg and i 5 ^ are 
identical. So it is natural to ask whether the distributions we define on Pg are comparable 
with other distributions in the literature that are defined on the same space in the decom- 
posable undirected graph setting. We first note that the traditional or classical Wishart 
distribution lfT6l on PD m (R) is a special case of n^ a - In particular the standard Wishart 

distribution with scale parameter U and degrees of freedom n is a special case of n® when 
Q is a perfect DAG with pa(i) = {i + 1, . . . , m}, and a t = n + m - 2i + 3, V 1 < i < m. In 
this sense we can regard 7t^f a as a generalization of the classical Wishart dsitribution. We 
also note that the ^-Wishart distribution introduced in ED for undirected graphs, that is 
the inverse of the hyper-inverse Wishart of which has a one-dimensional shape param- 
eter 6, is also a special case of the richer class 7^f a . The single shape parameter 6 for the 
Wishart is related to the or,- as follows: a t = 6 + 2pai + 2, 1 < i < m. 

In the decomposable graph setting, a more general family of distributions on Pg are the 
so-called type II Wishart distributions, denoted by W Pg , introduced in the seminal work of 
Letac and Massam in [TT5) , and later successfully used by Rajaratnam et al. [fT9l for high di- 
mensional Bayesian inference for undirected graphs. Recall that perfect DAG models and 
decomposable undirected graphical models are Markov equivalent, and therefore the cor- 
responding parameter spaces are the same. Hence a careful comparison between the DAG 
Wishart distributions introduced in this paper and the W Pg Wishart distributions introduced 
in |[T5) is very important. The multiple shape parameter for the W Pg family belongs to a 
set S, and is fully known only when the graph is homogeneous. In particular the set S in 
which the shape parameters lie is not fully characterized when Q is decomposable. In lTi31 
the authors however show that for any perfect ordering V of a given decomposable graph, 
there exists a well-describable set c S over which the corresponding normalizing con- 
stants are finite and independent of the scale parameter. Moreover, they conjecture that S 

Conventionally, by the Jacobian of a mapping we mean the absolute value of the determinant of the 
Jacobian matrix. 
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is indeed the union of Bp over all perfect ordering of the cliques of the graph Q. In an in- 
teresting and more involved development (see 0J) we show that for any non-homogeneous 
decomposable graph Q there is a perfect ordering P and a perfect DAG, Q, associated with 
this ordering such that the Wishart distribution W/> g on Bp is a special case of 7t^f a for a 
specific choice of a. Under this observation we prove in [|4|| that the Letac-Massam con- 
jecture is in general not true. When the graph H is homogeneous, we also show that there 
exists a unique perfect DAG version Q of H such that W P& , is a special case of our distri- 
bution on Pg. We emphasize here that the analysis of DAGs undertaken in this paper was 
the primary key to resolving the aforementioned conjecture, which in fact were developed 
for undirected (or concentration) graph models. The results in [|4]| which prove that W P& , 

introduced in 031, are a special case of n P J a are not the real focus here, are rather involved 
and a subject of interest in their own right. We simply mention these technical mathemat- 
ical results for comparison purposes as a detailed study is beyond the scope of this paper. 
Our focus in this paper rather, is to study the properties of our DAG Wishart distributions 
with the specific goal of using them for high dimensional Bayesian inference for DAGs. 

Another question of interest is the functional form of the induced prior on the space 
Q^. A preliminary result is required in order to determine this image measure. Suppose Q 
is a perfect DAG and Q u = (V, E u ) is the undirected version of Q. Grone et ah. [8j prove 
that for any incomplete matrix Y in Qg there exists a unique E(H in PD^ such that S £ = Y, 
where Z £ is the image of E under the projection mapping proj^ : S m (R) — » \g. This defines 
an isomorphism between Qg and Pg via: 

<p := (r k> Z(r)" 1 ) : -> Pg 
if' 1 =(flH {Qr l ) E ) .Pg^Qg 

The matrix Y.(Y) is said to be the (positive definite) completion of Y in PD^. 
The Jacobian of the mapping Y i-» 2XFT 1 , given in 11211 . is as follows: 

ir s ] (l ' v|+1)vW 

Consequently, the induced prior on is given as 

agyO - exp{--tr(E(D- 1 U)) n *f l s U D u (Yr^>^ 2 , Y 6 Q^. 

Evidently, since Tr^f is a generalization of the classical Wishart distribution and the 

G- Wishart, hence n^f a above is a generalization of the inverse Wishart distribution and the 
Hyper inverse Wishart (HrW). Furthermore, since the Wp g family of Letac-Massam lfT5l 

is a special case of n® a , the Inverse of W Pg denoted by rW Pg is also a special case of n^f a 
above. 



furthermore, it also defines a diffeomorphism. 
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8.1 Closed form expressions for perfect DAGs 

We now provide closed form expressions for expected values of Q. and the incomplete pos- 
itive definite (random) matrix Y e Q^, when Q is perfect. The main reason for restricting Q 
to perfect DAGs is that in this case Pg and Qg are open subsets of the Euclidean space R. |£| 
and therefore integrations w.r.t. Lebesgue measure are meaningful and the expected values 
are indeed defined within the parameter spaces. 

First note that when Q is perfect the Laplace transform of at K e S m (R) is given by 

£ Ps (K) = J cxp{-tr(Kn)}nl s a (n)dQ. 

/I m i 

exp{--tr((2tf + U)Q)} J~[ D^ ai+m+2 dQ 

Z g(2K+U,a) 
Zg(U, a) 

The following proposition gives the expected value of Q. ~ n® a when Q is perfect. 
Proposition 8.1. Suppose Q is perfect and Q, ~ with ai > pa t + 2, then 

m m 

E[Q] = Yffzt - pa; - 2) (u-^)° - ^fe - pa; - 3) . 

i=l i=l 

Proof. The proof is given in the Appendix/Supplemental section. □ 

Our next goal is to determine the expected value of T ~ n^f o . Note that in essence we 
can identify Y e Qg with its positive definite completion E = 2(T). Under this consideration 
we show that by a recursive algorithm one can calculate the expected value of S. 

Proposition 8.2. Let Qbe a perfect DAG and Y ~ n^f a , with c,- > pa t +A. Then the expected 
value of Y can be recursively computed in the following steps: 

i ~\ in r V 1 — Umm 
(t) ^V^mmi — 



(i,„ - 4' 

-l 

<i> ' 



(«•) E[E^j]] = -E[L <i> ]U~j > U < i ] , 

(Hi) E(S i7 ) = Uiil<i> +tr(E[^ iV ] ( Uiil<i>U «> + U^U^U^U-A), i = m-l,...,l. 
at - pa t - 4 \ \a t - pa t - 4 <l> <l> j) 

Proof. The proof is given in the Appendix/Supplemental section. □ 

Remark 8.1. We note that the recursive expressions in Propostion 18.21 are very similar to 
the expressions for the expected covariance matrix under the covariance Wishart priors 
introduced in [9] [Corollary 4] for homogeneous covariance graph models. The Markov 
equivalence of covariance graph models and DAG models for Q homogeneous explains 
this similarity. We note however that Propostion 18.21 is valid for all perfect graphs in the 
DAG setting, and is not confined to the restrictive class of homogeneous graph as in the 
covariance graph setting. 

PART II: Hausdorff DAG Wishart densities on curved manifolds 
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9 Density of n^ a w.r.t. Hausdorff measure for an arbi- 
trary DAG 

In this section we generalize the prior , obtained for a perfect DAG Q, to an arbitrary 

DAG. Recall that n^f , the image of the DAG Wishart n^f , is a distribution on Pg, the space 
of concentration matrices in the context of Gaussian DAG models. When Q is a perfect 
DAG the space Pg is an open subset of Zg = R |£| and therefore n^f has a density w.r.t. 
Lebesgue measure on R. |£| . The functional form of this density was derived in Equation 
(18.31) . When Q is no longer a perfect DAG several complications arise. In particular, the 
space Pg has Lebesgue measure zero in any Euclidean vector space containing it. This 
implies that n® a does not have a density w.r.t. Lebesgue measure. In theory a solution to 

this problem requires deriving the density of w.r.t. Hausdorff measure. This section 
elaborates on this topic in much detail. 

9.1 Lebesgue measure of Vg 

In this section we undertake a measure theoretic analysis of the space Pg when Q is not a 
perfect DAG. First note that Lemma I2TT1 implies the following: Pg c Pgm c Zgm. Now let 
Q = (V, E) be a non-perfect DAG, then Pg has Lebesgue measure zero in any Euclidean 
vector space containing it. The next lemma gives a formal proof of this assertion. 

Lemma 9.1. Suppose Q = (V, E) is a non-perfect DAG and *V a Euclidean space contain- 
ing Pg. Then *V contains Zgm. Moreover, Pg has Lebesgue measure zero in < V. 

Proof. For each (i, j) e E m with j < i let us define the elementary symmetric matrix 
E (ij) 6 S m (R) as follows: 

£(,-,•) = jl if {«,v} = {/,;}, 
|0 otherwise. 

Note that the set of &^ forms a basis of Zgm . It is clear that *V contains Zg D {E (l ^ : (z, j) 6 
E}. Hence it suffices to prove that *V contains the rest of E^. Now let (i, j) be in E m \ E 
with i > j. This implies that there exists k < j < i such that i — > k <— j. We define the lower 
triangular matrix L (!j) 6 Jig as follows: 

if (u, v) = (i, k), 
if (m, v) = (j, k), 
if m = v, 
otherwise. 

Then one can easily check that Pg 3 L (,y) (L (!j) )' = T + 2E^\ for some T 6 Zg. This shows 
that E {ij) g «y. Hence PgcF=> Z^ m c *V, thus c Z^m c «V. 

Now note that Pg is a manifold of dimension diffeomorphic to O^, which in turn is 
an open subset of Euclidean space of dimension \E\. Furthermore, recall that the dimension 
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of Zgm = \E m \ and is therefore strictly larger than the \E\. So any Euclidean space that 
contains Pg has dimension strictly larger than \E\. Hence Pg has Lebesgue measure zero in 
any Euclidean vector space containing it. □ 

Consequently, Lemma [9T| implies that if Q is non-perfect then n® a has no density w.r.t. 
Lebesgue measure. 

9.2 The density of n^ a w.r.t. Hausdorff measure 

We now proceed to derive the density of n^f w.r.t. Hausdorff measure^. Let Ag denote the 
set of (D, L) such that D e R mxm is a diagonal matrix and L e Jig. It is immediate that Ag 
is a real linear space of dimension \E\ with the following scaler product and sum operation, 
respectively. 

1. A(D,L) := (AD,AL), VAeR; 

2. (£>', L) + (£>", L") = (D, L), where D = (D' + D"), and L is a lower triangular matrix 
with L u = L.j + L"j if i ± j and L u = 1 . 

One can easily check that @g is an open subset of Ag. Now since Pg is a subset of Eu- 
clidian space Z G m we have if/ : Qg Z G m satisfies the conditions of Theorem 19.3 in [0. 
Hence we can proceed to obtain the density of n^ a w.r.t. the (iSI-dimensional Hausdorff 
measure on Z G m. To obtain an explicit expression for J(i/f(D, L)) we first need to compute 

dil/ki dif/ki 

the matrix of partial derivatives and . We order the coordinates of Ag as fol- 

oDa dL u 

lows: £>n,L 21 if (2, 1) 6 E,D 22 ,L 3l if (3, 1) 6 E,L 32 if (3,2) 6 E, . . . ,D (m _i )(m _i),L mZ , / = 
1, ... (m - 1) if (m, /) 6 E, D mm . Likewise, we order the coordinates of Zg m = R |£| x R |jr| , 
where J> : = E m \ E, by ordering first the positions (k, I) e E as above, in their entirety, and 
then we order the positions (k, I) e / according to their lexicographical order. Note that 
the latter positions correspond to immoralities. These partial derivatives can be computed 
as follows: 

d(LD- l L<) kl , 

= -D?L ki L u (9.1) 



dD u 
d{LD- l V) 



= 6 ik D J -}L l j + 6 il D- J }L kj , (9.2) 



where 8 UV is the Kronecker delta function. Using Equations (19.11 ) and ( 19.21) we partition the 
Jacobian matrix Dtff(D, L), considered as a mapping from R |£| to R |£| x R |J?I , into two blocks 
of matrices := Dtf/{L,D) EE of size \E\ x \E\ and := Dtf/(D,L) jr E of size \ J* \ x l^l, 
respectively. The matrix A^ is the same as the Jacobian matrix from Lemma |%7fl and is 
the last \y\-th rows of the Jacobian matrix Di{f(D, L), with each row of being the partial 



10' 



The reader is referred to |2, Section 19] for more details on this topic. 
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derivatives obtained by Equations (|9.1I) and (19.21) for (k, I) e J? and (i, f) e E. Finally, we 
can calculate the Jacobian of if/ as follows: 



Ji//(D, L) = det 



(A W 



= I det(A,)| ^/deta+A-'C^QA" 1 ) 
= f[ DT^ +2) ^/detC/ + A;c;C,A->). 

Therefore we have proved the following. 

Theorem 9.1. Let A^, be defined as the block matrices in partitioning of the (Hausdorff) 
Jacobian matrix oft// above. Then the density ofn^ a w.r.t. Hausdorff measure on Zgm 
is given by 

1 m i 

zg(U, ay 1 exp{--tr(ft[/)} ]~[ D ^ ai+m+2 det(7 + A^C^A" 1 )"^. (9.3) 




Figure 1: Wishart density w.r.t. a Hausdorff measure 

Example 9.1. Consider DAG Q given in Figured! The Jacobian matrix corresponding to 
Equation (19.11) and Equation (19.21) are given as follows: 

















-L 2 iD 2 n 













-L 2 D~ 2 


2L 2l D-{ 


-D~l 








-L 31 D' 2 













-L 2 D~ 2 

^31^11 








2L n D~l 












L2iD~l 


~ D 33 



By computing det(M' M^) we obtain 



W(D,L) = D\\D^D' 2 (4 + 44 + l) 
27 
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Thus the density of n-J w.r.t. 1H 5 on K 6 is given by 

1 1 -21+4 -22. +2 -21+2 / /i o \-l/2 

^([/, a)" 1 exp{--tr(Qtf)}D n 2 +4 £> 22 2 +2 D 33 2 [h A n + 4Z|, + l) , 
where D H and L i; - are considered as functions of Q. 

PART III: DAG Wishart densities on incomplete covariance spaces 

10 The DAG Wishart distribution on the space of incom- 
plete concentration matrices 

The previous section demonstrated the difficulty of working with for a general DAG Q: 
first, the density does not exist w.r.t. Lebesgue measure but only w.r.t to Hausdorff measure, 
and secondly even w.r.t. Hausdorff measure the density iz® a is not easily computable. It 
is not immediately clear how to overcome this obstacle. We remind the reader that this 
problem does not occcur when Q is restricted to the perfect/decomposable as treated in the 
work of lfT5l l9ll7l[T9ll. To avoid the complexity inherent in the nature of Pg, we propose an 
approach to work with only the functionally independent elements of Pg and consequently 
demonstrate that this can lead to fruitful results. More precisely, we consider the projection 
of Pg onto the space of incomplete symmetric matrices where the specified entries are in 
positions determined by the edge set of Q, i.e., along the edges. To this end we shall 
first demonstrate that Pg can be easily identified with a new space through a simple 
isomorphism, where the latter is Euclidean from a topological perspective and on which 
the standard Lebesgue measure is defined. 

10.1 The space of incomplete concentration matrices 

First recall the definitions of the sets Zg and lg from §14.3 1 and that Pg is the space of con- 
centration matrices corresponding to the Gaussian DAG model jV (Q). There is a natural 
injection (r i-> (T) ) : lg — * Zg, where (T)°, as defined earlier, "fills" or "completes" the 
unspecified positions with zeros to obtain a full matrix in R VxV . Note that for each clique 
c of Q the restriction of Y on c, denoted by Y c is a full matrix and, moreover, Y is uniquely 
determined by the blocks of matrices (Y c : c e ^g)- 

Definition 10.1. Suppose "K c R" ,xm and T is a ^-incomplete matrix in \g, then we say Y 
can be completed in "K if there exists a matrix A 6 'K such that A c = Y c for each clique 

Remark 10.1. If "K is the set of positive definite matrices PD,„(R) then the completion de- 
fined above reduces to the standard definition of positive definite completion [8 J. We note 
however that in what ensues below, the positive definite completion refers to completion 
of partially positive concentration and covariance matrices that correspond to DAGs vs. 
those that correspond to undirected graphs as in [8]. Note also that a necessary condition 
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for a positive definite completion of an incomplete matrix is that it belongs to Q^. This 
requirement is simply derived from the condition that the principal minors of positive def- 
inite matrices are positive definite. Thus we shall henceforth only focus on completion of 
partially positive definite matrices over Q, i.e., those elements in Qg as compared to the 
larger class lg. 

Proposition 10.1. Let T be a ^-incomplete matrix in lg. Then: 

1. Almost everywhere (w.r.t. Lebesgue measure on lg), there exist a lower triangular 
matrix L e Jig and a diagonal matrix A e R mXm such that T : = LAV is a completion 
ofT. 

2. The completion algorithm to construct A and L are given as follows: 

i) Set Ljj = Ofor each (i, j) E. 

ii) Set An = Tn, La = A'^Yn for each i e pa(l) and set 7 = 1. 

Hi) If j < p, then set j = j + 1 and proceed to step v), otherwise L and A are 
constructed such that they satisfy the condition in part (a). 



k=i 

to step Hi). IfAjj = 0, then no completion off exists that satisfies the condition 
in part (a). Consequently, T cannot also be completed in Pg. 

3. The matrix Y is the unique positive definite completion of T in Pg iff the diagonal 
entries of A are all strictly positive. 

Proof. The proof is found in 0. □ 

Remark 10.2. Note first that in Proposition 110.11 the algorithm itself determines if T can 
be completed in Pg. Furthermore, the process described in Proposition llO.il succeeds to 
complete Y in the space of symmetric matrices S m (R) as long as Ajj + 0. However, the 
completion is in Pg iff A ;; are all strictly positive. 

Example 10. 1. Let Q be the DAG given by Figure [T0TT1 
(a) Let Yi be the ^-incomplete matrix 

(A 8 8 ? ^ 




Yi = 



v 



8 19 ? 9 
8 ? 18 6 
? 9 6 44 
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Figure 2: Completion in Pg 



Applying the algorithm described in Proposition 1 1 0. 1 1 we obtain: 



< \ ^ 

2 10 

2 10 

3 3 1 



A 



MOO ^ 

3 

2 

-1 ) 



( A 8 8 ^ 

8 19 16 9 

8 16 18 6 

9 6 44 J 



The negative element in A demonstrates that Y! cannot be completed in Pg. 
(b) Let Y 2 be the ^-incomplete matrix 



Y, 



< 1 
1 

2 

v ? 



1 

3 
9 



2 

? 

5 
1 



? 1 
-2 
1 
4 



Applying the algorithm in Propo sition 1 1 . 1 1 once more, we obtain 



L = 



i 1 
1 

2 




^ 

1 
1 

-1 1 1 ) 



A = 



f \ ^ 

2 

10 

^ 1 ) 



Y, = 



I 1 
1 
2 




1 

3 
2 



2 
2 
5 
1 



* 
-2 
1 
4 



Proposition 110.11 guarantees that the above yields the unique positive definite completion 
of Y 2 in Pg. 

The following corollary is an immediate consequence of Proposition llO.il 

Corollary 10.1. Let denote the set of Y e that can be completed in Pg. Then the 
mapping (Q. i-> Q £ ) : Pg — > Rgi is a bijection with the inverse mapping ThT. 

In essence Corollary 110.11 identifies our concentration matrix space Pg with another 
space Rg through a bijection. The space Rg is an open subset of Euclidean space R |£| since 
Rg is homeomorphic to Qg. We shall henceforth refer to as the space of incomplete 
concentration matrices over Q. 
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10.2 The Wishart distribution on 

Recall that given a matrix Q. e Pg, the ^-incomplete matrix £l E contains all the functionally 
independent entries of Q, and, by Proposition 1 10. 1[ one can always recover the remaining 
entries of Q in polynomial time. Let n^f a denote the image of 7?® a under the mapping 
Q i-» £l E . Since is an open subset of Euclidean space R |£| this distribution has a density 
w.r.t. Lebesgue measure on Rg. Hence n^f a can be considered as the DAG Wishart distri- 
bution on R^ in both a natural and practical sense. We now proceed to state general results 
corresponding to our DAG Wishart distributions on the space of incomplete concentration 
matrices for ALL DAGs, and not just perfect DAGs as given in Section [8TTT 

Theorem 10.1. Let Q. ~ 11 ® a and let Y = proj(fi) = Q. E , then 

1. The density ofY~ n^f a w.r.t. the standard Lebesgue measure on Kg is given by 



Z g(U, a)~ l exp{--tr(T£/)} D^°" 



+paj+2 

"a 

i=l 



1 

i\<i>' 



where Z)„ = (T ) 

r g f f Zg(2K + U, a) 

2. The Laplace transform ofn,, at K , where K > 0, is given by £ R JK ) = . 

Zg(U,a) 

3. E[T] = proj^ (Ti = M ~ Paj - 2) (u^f - - paj - 3) (u^fj 



Proof. The distribution is the image of n^f under the mapping (D,L) i-» (LD~ l L') E . 
Thus it suffices to compute the Jacobian of this mapping. One can readily check that the 
Jacobian is equal to the Jacobian of if/ in Equation (|8.2I) . Similar calculations as in Section 
18. 1 l yields the Laplace transform and expected value of n^f a . □ 

11 The DAG inverse Wishart on the space of incomplete 
covariance matrices 



A natural question that follows from the last section is to determine the image of n^f on 
the space of covariance matrices as denoted by PD^. This measure is the induced prior on 
the space of covariance matrices that correspond to a Gaussian DAG model. In this section 
we first identify a subset of the space of ^-incomplete matrices, called Sg, that can be 
identified with the space of covariance matrices PD^. Thereafter we shall define the DAG 
inverse Wishart distribution on the space of incomplete covariance matrices Sg. 
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11.1 The space of incomplete covariance matrices 

Recall that PD^ is the set of positive definite matrices E in PD m (R) such that N m (0, E) 
belongs to jV(Q). Equivalently, if 2 is a positive definite matrix, then 

E e PD^ if and only if E t # = E^yE^E^y, for each i eV. (11.1) 

Using this characterization allows us to identify PDg with the functionally independent 
elements of E. The following result is a key ingredient in this identification. 

Proposition 11.1. Let Y e Qg, then 

1. There exists a completion process of polynomial complexity that can determine whether 
r can be completed in PD^; 

2. If a completion exists, this completion is unique and can be determined constructively 
using the following process: 

i) Set Ey = Tij for each (i, j) e E and set j = m. 

ii) If j > 1, then set j = j - 1 and proceed to the next step, otherwise E z'^ success- 
fully completed. 

Hi) If ?,<j> > 0, then proceed^ to the next step, otherwise the completion in PD@ 
does not exist. 

iv) IfE-tj] is non-empty, then set E^-j = E^ 7 vE~j v E <7 ], E [7 y = E^ and return to step 
(2). 

Proof. The proof is found in (5J. □ 

Remark 11.1. Note that in Proposition ! 1 1.1[ once more, the algorithm itself determines if Y 
can be completed in PD@. It is clear from Step (Hi) above that the necessary and sufficient 
condition for the positive definite completion to exist is for the covariance sub-matrix of 
the family of each node j to be positive definite, i.e., E^ y > > and not just E <7 y > 0. 
Furthermore, unlike Proposition llO.il Proposition 1 1 1 . 1 1 can terminate midway. 

Now define Sg as the set of all Y 6 Qg which can be completed in PD^. The next 
corollary formalizes the fact that Sg can be identified with PD^. 

Corollary 11.1. Let Q = (V, E) be a DAG, then the mapping (E h-> E £ ) : PDg — » 

is a bijection with inverse mapping Y i-» E(F), where E(T) is constructed according to 
Proposition \11.1\ 

Proof. The proof is immediate from Proposition II l.ll above. □ 

Remark 1 1.2. Note that when Q is perfect PD^ is identical to PD^u . Thus by the completion 
result in Grone et al. [8J, when Q is perfect every partial positive definite matrix in Qg can 
be completed in PD^. Hence for Q perfect Sg and are identical. 

"Note that for each j, the submatrix I.<j> is fully determined by step (ii) 
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We now proceed to illustrate the result of the completion process on two examples. 



Example 11.1. (a) Consider the DAG Q given in Figure |3(a)| and let H be the partial 
positive ^-incomplete matrix given as follows: 



r, 



1 0.9 ? 

0.9 1 0.9 

? 1 1 

-0.9 ? 0.9 



-0.9 ^ 

? 

0.9 
1 



Applying the completion process yields the following results: In step (iv) for j=2 we 
obtain £42 = 0.9. From this, in step (Hi) for j = 1 we obtain 



' 1 0.9 
0.9 1 
-0.9 0.9 



-0.9^ 
0.9 
1 



which is not a positive definite matrix. Hence the completion process demonstrates that no 
completion of V\ exists in PD^. 





(a) 



(b) 



Figure 3: Completion in PD^ 



(b) Consider the DAG Q as given in Figure |3(b)| and let Y be the ^-incomplete matrix 
given by 



( 3± 

J 3 
-1 
1 

3 
? 

? 

7 



-1 

6 

1 

-H 

? 

7 



3 
? 

22i 
-11 

? 
7 



7 

-11 

5^ 
-2 

-1 



? N 
? 

7 

-1 

7 

1 > 



Then by applying the completion process in Proposition [TTTT] we obtain 





-1 


1 

3 








> 


-1 


6 


11 


"5i 


2 


1 


1 

3 


11 


22 1 


-11 


4 


2 





-H 


-11 


5^ 


-2 


-1 





2 


4 


-2 


1 
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From the completion process it is easily verified the matrix Z is a positive definite matrix. 
Moreover, Proposition 111.11 guarantees that it is the unique completion of T in PD^. 



11.2 The DAG inverse Wishart distribution on 

Let 7rJ a denote the image of n v e a under the mapping (D, L) h-> (L"'DL f ) £ : ©g — > Sg. 

Similar to our interpretation of n^f a we will consider as the inverse Wishart distribution 
for the DAG Q. Next we proceed to derive the density of this distribution w.r.t Lebesgue 
measure. To this end, recall the following notion from Il20ll . 

Definition 11.1. Let T be a symmetric matrix in M. VxV , where as usual V = {1, . . . , m}. 
The Isserlis matrix of T, denoted by Iss(T), is the symmetric matrix indexed by the set 
"VP - i(h J) '■ h j <= V,i> j] with entries 

Iss(T) ijM = T ik T fl + T u T jk , (i, j), (k, I) e W. 

We also note the following properties of Iss(T): 
(0 det(Iss(T)) = 2 m det(7T +I , 
(ii) Iss(T) is invertible if and only if T is invertible. 

Now for a subset % of W define as the incomplete symmetric matrix the entries of 
which are specified as Tff = Tf = Tn for each (i, /') e % . If in addition, T is invertible, 
then we shall denote (T~ 1 )' 9/ by . We caution the reader that the notation T differs 
from T" where a c V. In the former, c W c V x V, whereas the latter refers to (T~ l ) a , 

i.e., a <z V. With this notation in hand we now proceed to state the functional form of the 

s 

the DAG inverse Wishart distribution . 

Proposition 11.2. Let Q = (V,E) be an arbitrary DAG and let Z ~ ^ua- Now let T = 
proj(Z) = Z £ , then the density ofT ~ n® a w.r.t. Lebesgue measure is given by 

1 m i 

7% a (T) = 2 m zg(U,ay l exp{--tr(S(rr 1 ^)} n^^'^'^detass^^)- 1 , (11.2) 

2 i=i 

where = "V \ E, and where "V is the edge set ofQ m , with the convention that if(i, j) e "V , 
then i > j, and D u = E(r)«Hj>. 

Proof. First note that functional form of n ( f a can be obtained as the image of n^ a under 

the mapping T h-> proj^(T)" 1 : R@ — » Sg. Let us denote the inverse of this mapping 
symbolically by Z £ i-» Z~ £ . We now proceed to evaluate the Jacobian of this mapping. 
Following the notation in fl201 let M+(£ m ) denote the set of Z r such that Z e PD^», and 
let M + (£ m ) denote the the set of YT y such that Z 6 PD^. By (201 Equation (11)] the 
derivative of the mapping Z r i-» Z~ y is given by 



dZ 



-v 



= -Iss(X-% ll7 Iss(I) rir , (11.3) 
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where "V = W \ "V . Since any distribution that obeys the directed Markov property w.r.t. 
Q will obey the Markov property w.r.t. Q m (see Lemma [27Tb , hence E G PD^ implies that 
2 G PD^m. Therefore, the mapping E £ i-» Yr E is the restriction of the mapping E r i-» S - ^ 
to Sg. Thus by using Equation (|1 1.31) above we obtain 



= lss{YT%^Iss{I) E ^ 
= lss(Z- l ) Ell? I SS (I) E , 

where the last two steps follow from the fact that Iss(I) is a diagonal matrix and that 
Iss(I) E ^ r = Iss(I) E . By using Equation (2.1) in [?] and Equation (9) in Il2~0l respectively, 
we can write 

Iss(2T V = (issCT 1 )^)" 1 

= (issilflss&h^Issilfy 1 . 

Hence the Jacobian of the mapping 2 i-» E ~ E is equal to 2'" det(Iss(S) £ |,y)" 1 . The functional 
form of TTjfaCD now follows from a change of measure calculation. □ 

Remark 1 1.3. Note that for calculating the Jacobian term in the density 7t®JT) above, one 
only needs to evaluate E(r) r , that is the completion of Y in PD^, restricted to the entries 
that correspond to Q m . 

We now proceed to illustrate the proposition above on a concrete example. 




Figure 4: Wishart density over S^. 

Example 11.2. Consider the DAG Q given in Figure |U Let us now apply the results 
of Theorem II 1.21 to derive the density of T ~ . First we need to find E(r) r , the 
completion of T in PDg restricted to the entries that correspond to Q m . To this end, 
we only need to determine E 32 (since all other sentries of E are already specified). The 
equation Epjt = E^E'-^E^ and the fact that pa(3) = implies that E 32 = 0. The 
next step is to compute Iss(E). Note that Iss(E) is a 6 x 6 matrix in R Wxy ^ , where W = 
{(1, 1), (2, 1), (2, 2), (3, 1), (3, 2), (3, 3)}. Recall that 

Iss(E)^ = ZfitZji + XnLft, (i, j), (k, I) e W. 
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and hence 
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Now for £ = {(1,1), (2, 1),(2,2),(3, 1), (3, 3)} and ^ = {(3, 2)}, note that Iss(X)^ is equal 



to 
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From this we can compute det(Iss(X) £ | /) explicitly. After some simplification the final 
expressions is given as follows: 



det(Iss(X) £M r) 



8 (X33X 21 + X22X 2 j - XnX22X 33 ) 
X22X33 



7^f a (T) w.r.t to Lebesgue measure: 



The Jacobian calculation above allows us to specify the functional form of the density 



*t a V) = Zg(U,ar l exp{--tr(X(D- 1 t/)}D- 1 5 

X (X33X21 + X22X^j - Xi 1X22X33) X22X33 



^22 



Ao'2+2 r .-4f'3+2 



D 



33 



ai + A jy 5 «2 +2 n - 3 ff3 +2 



22 



D 



33 



ZgiU, ay 1 exp{--tr(X(D- 1 U)}D U 



(Xll|^l>) 4 (X 2 2X 33 ) 4 X22X33 

(11.4) 



The Isserlis matrix expressions in Proposition ! 1 1 .2l provides a useful tool for computing 
the Jacobian of the mapping (X £ i-» 1T E ) : Sg — » Kg. Nevertheless, Example 1 1 1 .21 clearly 
demonstrates the complexity of the lengthy computations involved: even for the simplest 
of DAGs. A closer examination of Equation (II 1.4b suggests that the final expression for 



Tiy Ql) may be simplified in terms of the local properties of the DAG Q. To show that this 
is indeed the case we first prove the following lemma. 

Lemma 11.1. Let Q = (V, E) be an arbitrary DAG, then the Jacobian of the mapping 
(X" £ i — ► X £ ) : — » Sg is given as follows: 



nx--det(X^). 



36 



Proof. First note that the mapping E E h-> E £ can be written as the composition of the 
two mappings (E" £ h> x^^^E^E^) : % -> and Cx^E^^E^E^) i-» 
E £ ) : — » Sg. It is easy to check that the Jacobian of the first mapping is the same as the 
Jacobian of the inverse of the mapping if/ : (L, D) h-> LD~ x V in Lemma I8TI and is therefore 
equal to Y\T=i ^kt- • ^° ^ remams to calculate the Jacobian of the second mapping. 

We shall proceed by mathematical induction. Let us assume by the inductive hypothesis 
that the Jacobian of the mapping (x™ ^E^y, E'^E^j) i-> E £ ) : Eg — » is equal to 
det(E <!V ) for any DAG Q with |V] < m. We will show that the result will also hold 
true for |V| = m. The case m = 1 is trivial. So assume that m > 2. Let Q m be the induced 
subgraph of Q with the vertex set V[i] := V \ {1} and the corresponding edge set, denoted 
by E[i]. Since V[i] is an ancestral subset of V, if E £ belongs to S^, then E £[1] , the projection 
of E on I G[1] , is an element of Sg [lv Furthermore the positive definite completion of in 
PD^ m is indeed the principal sub-matrix Ey m . The above two observations simply follow 
from the recursive nature of the completion process in Proposition lll.il ). Now consider the 
following composition of the inverse mapping E £ i-> x™ ^E^y, E^^E^/j) 

S^ -> R + x R <1] x S §m -» R + x R <1] x = Eg 

E £ ^ (En^E^E^E^) ^ (Sni^.E^^u.X^CZai^.Z^f,)) 

By the inductive hypothesis the Jacobian of the second mapping, 

(E E £ "0 ^ (E 11H1> ,£-j > £ <1] ,x- 2 (E, H y,E^E, ! . ] )), 

is equal to Y\T = 2 det(E <iV ) _1 . Hence it suffices to prove that the Jacobian of the first mapping, 

E £ = (E^E^E^m) » (Em^^E-j^E^^E^) = (E n - E^E^E^E^E^E^) 

is det(E < i > )" 1 . This follows by noting that the Jacobian matrix of this mapping is lower 
triangular and is given as follows: 

7 N 

Hence the results now follows by induction. □ 

We now proceed to state the functional form of the density of n S J a w.r.t Lebesgue mea- 
sure without using Isserlis matrices. 

Corollary 11.2. Let Q = (V, E) be an arbitrary DAG and let E ~ 7r™ & \ Now let T = 
proj(E) = E £ , then the density ofT ~ n^ a w.r.t. Lebesgue measure is given by 

1 m ! 

zg(U, ay 1 exp{--tr(E(r)- 1 £/)} Fl E",^ detCE^)" 1 . (1 1.5) 

1 i=i 

Remark 11.4. In Remark [1 1.21 we established that when Q is perfect Sg and are identi- 
cal. Hence for Q perfect n^f a and n^f a are the same distribution. 

We conclude this subsection by making the observation that the expression in Equation 
(II 1.51 ) is much simpler to evaluate than the expression in Equation ( II 1.2b . 
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11.3 The inverse DAG Wishart for homogeneous DAGs 

In this paper we proceed to formally demonstrate that the class of inverse DAG Wisharts 
naturally contains an important sub-class of inverse Wishart distributions for that was 
introduced by Khare and Rajaratnam [9] in the context of Gaussian covariance graph mod- 
els. In the process we also demonstrate that for a special class of DAGs the functional form 
of the density of the DAG Wisharts n® a can be considerably simplified. Recall that a Gaus- 
sian covariance graph model over an undirected graph G = (V, Y), denoted by r yV(G cov ), is 
defined as follows. 

Definition 11.2. Let PD Gcov denote the set of positive definite matrices 2 6 PD m (R) such 
that Zy = whenever i -p g h i- e -> when i and j are not neighbors. Then the Gaussian 
covariance graph model over G is defined as 

^(G cov ):={N m (0,2):2ePD Gcm ,}. 

A formal comparison between the DAG Wishart priors introduced in this paper and the 
covariance Wishart priors introduced in [0 requires a few technical definitions. 

Definition 11.3. a) A DAG Q is called a homogeneous DAG of type I if it is transitive 
(i.e., i — » j — > k implies that i — > k), and perfect. A DAG Q is called a homogeneous 
DAG of type II if it is transitive and does not contain any induced subgraph of the 
form j <— i — » k. 

b) An undirected graph G = (V, Y) is called homogeneous if for each pair of vertices 
i, j e V, 

i ~ G j => ne(i) U {/} c net/) U [j] or net/) U {j} c ne(0 U [i}. (1 1 .6) 

Equivalently, a graph G is said to be homogeneous if it is decomposable and does not 
contain the A 4 path as an induced subgraph. The reader is referred to |fT5l for further details 
on homogeneous graphs. 

Note that if Q is a homogeneous DAG of either types, then Q u is homogeneous. On 
the other hand, if G = (V, Y) is homogeneous, then one can construct a homogeneous 
DAG of type I or II that is a DAG version of G. This can be achieved by using the Hasse 
tree associated with the homogeneous (undirected) graph and using the given orientation to 
obtain a DAG of type I. Reversing the orientation (i.e., redirecting all the arrows to the root 
of the tree) will yield a DAG of type II. More precisely we shall now show an example that 
constructs a DAG version that is homogeneous of type II. Let Q be a directed version of G 
obtained by directing each edge i ~ G j to a directed edge i — > j if ne(/) U {/} c ne(j') U {j}, 
or j — » i if ne(j') U {j} c ne(z') U {/}. If ne(z') U {?} = ne(j) U {j}, an arbitrary direction is 
chosen. From Equation (11 1.6b one can check that Q is a transitive DAG and it does not 
contain any induced subgraph of the form j <— i — » k. In general, it can be shown that if Q 
is a homogeneous DAG of type II and a DAG version of G, then JVifg) is identical to the 
Gaussian covariance model <yV(G C0V ) in the sense that PD Gcov = PD G (see [18J for instance 
for more details). It is also evident, from the Markov equivalence of perfect DAGs and 
decomposable graphs, that for a homogeneous DAG Q of type I which is a DAG version of 
G, we have PD G = PD^. 
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Proposition 11.3. Let Q = (V,E) be a homogeneous DAG of either type I or II and let 

G = (V,Y) be a homogeneous graph. 

S 

a) The density ofn^ a is given by 

Z @ (U, a)" 1 expj-trdXry 1 U)} f] ^ +2chm) , (1 1 .7) 

where ch^Q) = \ch@(i)\. 

b) IfQ is of type II and a DAG version ofG, then the open cone PDg co , can be identified 
with Sg via the bijective mapping 

r^[r] :=s(r):S^^PD Gcov . a 1.8) 

Let TTf/cf 00 " denote the probability image of the inverse DAG Wishart n^ a under the 
mapping in Equation (11 1.8b . Then the density ofn P ^ cm w.r.t. Lebesgue measure is 
given by Equation (II 1.7b . 

Proof, a) In light of Equation (|1 1.5b in Corollary II 1.21 it suffices to prove that for every 
2 G PD^, 

HdetCS^n^f. (H.9) 

ieV ieV 

1 . Suppose that Q is homogeneous of type I. We shall first show that for every / 6 V 

det(S <iV )= f] Z m<f> . (11.10) 

tepa(i) 

If pa(7) = for some i, then by our convention det(£ <( >) = 1 and ^u\<e> = 1 for any £ G pa(?) 
and therefore Equation (|1 1.10b holds. Now let £ Q be the smallest integer in pa(z'). One then 
can easily check that since Q is both transitive and perfect we have pa(z') = {£ Q } U pa(^ )- 
From this we write 

det(2 <fo> ). 

Now by repeating this procedure we obtain the result in Equation dl 1.10b . Finally we write 

ndet ( ^ ) =nn s ^=n s ^ ) - 

ieV i'eV fepa(i) ieV 

2. Suppose Q is homogeneous of type II. We shall proceed by induction. It is clear that 
Equation (111.9b holds when m = \V\ = 1. Now by the inductive hypothesis assume that 
Equation (II 1.9b holds for every homogeneous DAG of type II, connected or disconnected, 
with fewer vertices than m = |V|. Using the inductive hypothesis we shall show that Equa- 
tion (|11.9b will also hold for Q with m vertices. Now let S G PD^ be given. 
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Case 1) Suppose that Q is connected. Let T> be the induced DAG on V \ {1}. It is clear that 
D is a homogeneous DAG of type II and therefore by the induction hypothesis 



nde.(^)=nnr. 



j=2 i=2 



where *P = £y\{i}- Note that D is an ancestral subgraph of Q and hence fa ffl (/) = fa^(?) 
for each i = 2, . . . ,m and consequently = and = E^y. All together 

these imply the following: 



ndet(I^) = fK 



? chi(T» 
ii\<i> 

i=2 i=2 



Now we claim that fa^(l) = V. Assume to the contrary that V \ fa^(l) + 0. Since Q 
is connected, this implies that there exist vertices i e fa^(l) and j e V \ fa^(l) such 
that i,j are adjacent in Q. But this implies j —> i —> 1 or j <— / —> 1. By definition 
these induced subgraphs cannot occur in Q. Thus < 1 >= V and therefore we have 



det(2W) = E" 1 ,^ det(E) = f] E,- i1<!V . 



i=2 

Also the fact that fa^(l) = V implies that for each i e V \ {1} we have 

cW) = c/!i(D) + l. 



Therefore 



\~[ det(S <lV ) = det(2W) f] det(E <iV ) 

i'eV i=2 
m m 

- FT T Fl Y c/l .(£>) 

- 1 j "1 <,v j j ^ii\<i> 
i=2 i=2 

n y cft ,(£) 
j7|-<i>- ' 

i'eV 



Case 2) Suppose ^ is disconnected. Let T)\ and 2) 2 denote respectively the induced sub- 
graphs of Q on fa G (l) and V \ fa^(l). It is clear that T)\ and JD 2 are both homoge- 
neous of type II. In addition it is also easily verified that they are ancestral. Now 
let *P := 6 PD^j and W := 2y\ fag(1 ) e PD^. Now applying the induction 
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hypothesis and the fact that T)\ and D 2 are disjoint we have: 
f~[det(S^)= |~[ det(E^) |~| det(^ ;> ) 

feV i'e%(l) ieV\fag(l) 

= [] det(¥ <iV ) [] detCP; !V ) 

isfag(l) !eV\fa^(l) 

= f] det(¥, HiV ) c/, - (2) ') Y] tet(% il <i>) ch,<D2) 
= Y\ det(S, H;V ) cft '^ det(Z^) c ^ 

(6fag(l) ieV\fag(l) 

= f[det(i: ; vH i v) c/! ' ( ^. 



Z?) It is clear that the mapping in Equation (II 1.81 ) is a diffeomorphism and the Jacobian 
of this mapping is 1 . Thus the density 7r™ Gco ' w.r.t. Lebesgue measure is also given by 
equation (111.71) . □ 

Remark 11.5. We note that for a homogeneous graph G the distribution 7r™ Sc< " with the as- 
sociated density derived in Equation (II 1.71) coincides with the inverse Wishart distribution 
(or covariance Wishart priors) introduced by Khare and Rajaratnam (5]|. 

11.4 Further properties of the DAG Wishart distributions 7r^f a and 

H U,a 

s 

We now proceed to derive useful properties of the DAG inverse Wishart distribution n® a . 
To this end, let us carefully lay out the setting. We begin with the Bayesian Gaussian model 
^Y{Q) with parameter space S^. Now the elements of JV(Q) are of the from N m (0, 2), such 
that 2 £ e S^. Therefore, if x ~ N m (0, 2), then for each i e V the distribution of x^y is 
parametrized by (£,7|<,y, 2^5^,-]). The following theorem formally establishes the strong 
directed hyper Markov property for the class for an arbitrary DAG. 

Theorem 11.1. Let Q = (V, E) be an arbitrary DAG. If 2 £ ~ nj a , then 

i) {(2,7|<i>, 'E~] > 'E < i] : i eV} are mutually independent and therefore n^ a is strongly directed 
Markov. 

ii) The distribution ofXu^ and S'^S^IS^y are, respectively, given by 

S !;1<( y ~ IG(-j ~ ~h ^t/«Hi>), and (H-H) 

S^Z^II^y ~ N^^.E^O- (11.12) 
Proof. The proof is omitted as it follows similarly to the one in Theorem |6.1| □ 
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We now proceed to evaluate the expected value under nj . First note that since is 
an open subset of R |£| , the expected value of n^f is well defined. 

Proposition 11.4. Let Q be an arbitrary DAG and E £ ~ n u a > with a > pat + 4. Then the 
expected value ofL E can be recursively computed by the following steps: 



a 



-4' 



(ii) E[E^j] = -E[L <i> ]U~l > U <l ' i , 

m **> = +tr ( E[E - ] {-^04 + U ^ U <^ U ~4' i = m ~ h • • • ' L 

Proof. Since Equation (II 1.1 II) and Equation (111.121) are analogous versions of Equation 
(16.11) and Equation (16.21) . but for general DAGs, the proof follows along the same lines as 
the proof in Proposition 18 .21 and is therefore omitted. □ 

We now proceed to analyze the DAG Wishart distribution n^f a as a class of distributions 
in their own right. Once more let Q be an arbitrary DAG and a a given vector in R m such that 
a, > pat + 2, V/. Now consider the family of DAG Wishart disitrbutions {n^f a : U e PD^}. 
Since PDg: is isomorphic to Sg via the mapping U i-» U E , it is more natural to parametrize 
this family of distributions as {nJ E : U E 6 S^}. It is easy to check that this is an identifiable 
parametrization, i.e., if is a.s. equal to 7r^J , then Uf = U E . The following lemma 
formalizes these points. 

Lemma 11.2. Let & be a perfect DAG and let a be given. Then the Wishart family {n^ E : 

U E e Sg}, or equivalently {n P J E : U E e Sg}, is a general exponential family. If Q is not a 

perfect DAG then {n [ f Ea : U E e Sg} is no longer a general exponential family but a curved 
exponential family. 

Proof. Let t : Rg —> Zg be the embedding T i-» [T]° and let rj : S^ — » Zg be the embedding 

U E i — > . Then tr(Tt/) is equal to the inner product of [Y]° and [u E ] in Euclidian 

space Zg. Note also that under these natural embeddings both and S^ are open subsets of 
Zg. The result that {n P J E a : U E e Sg}, is a general exponential family follows immediately 
from these observations. 

Now if Q is not perfect, the expression tr(Y£/) not only depends on the entries in posi- 
tion ij where i, j are adjacent in Q, but also on a position ij where there exists an immorality 

i -> k <- j. Therefore, tr(Tf/) is not equal to tr([T]° [u E ]°), the inner product of [Y]° and 



[u E 



o 



in Zg. It is however is clear that tr(Y£/) is the inner product of the projection of Y 
and U in Euclidean space Zgm, which has higher dimension than \E\. Hence when Q is not 
perfect {nJ E : U E e Sg} is no longer an exponential family, but only a curved exponential 
family. 

□ 

Note that the proof of Lemma fl 1.21 shows that for an arbitrary non-perfect DAG Q, the 
family of DAG Wishart distributions {tt^| : U E e Sg} is strictly a subfamily of : 
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U e PD m (R)}. On the other hand, if Q is perfect, then {n v e a : U e PD m (R)} is identical to 
{n% a : U E e S^}. 

12 Closing remarks 

This paper introduces a class of multi-parameter hyper Markov laws which generalize the 
classical Wishart distribution in a way that is useful for Bayesian inference for Gaussian 
directed acyclic graph (DAG) models. The paper then proceeds to develop a theoretical 
framework for Bayesian inference for DAG models in the Gaussian setting. The main 
breakthrough that has been achieved is that the framework applies to all DAG models and 
not just the narrower class of perfect DAGs. The perfect or decomposable assumption, 
a common feature in theoretical analysis of concentration and covariance graph models, 
tends to yield more abstract results, as compared to practical procedures. The development 
undertaken in this paper is free of such assumptions as it applies to all DAG models. This of 
course has tremendous benefits for applications in high dimensional settings. More specif- 
ically, the class of DAG Wishart distributions that are developed and investigated in this 
paper yields a rich and flexible class of conjugate Wishart distributions which generalize 
previous Wishart type distributions introduced in the literature. We proceed to demonstrate 
that normalizing constants, hyper-Markov properties, moments and Laplace transforms are 
available in closed form for our family of DAG Wisharts. Sampling from the distribu- 
tion also does not resort to expensive computational techniques - resulting in inferential 
procedures that are scalable to very high dimensional problems. 

Despite the advantages of this class of DAG Wishart distributions, we demonstrate that 
it is a challenge to evaluate their densities on the space of covariance and concentration 
spaces, as these are curved manifolds. In particular, covariance and concentration spaces 
for non-perfect DAGs correspond to non-Euclidean spaces, on which densities w.r.t stan- 
dard Lebesgue measure are not defined. The results in this paper develops two approaches 
to deriving priors on covariance and concentration spaces corresponding to arbitrary non- 
perfect DAGs. In the process classes of DAG Wishart and DAG inverse Wishart distri- 
butions have been introduced and studied. Moreover, posterior moments are derived and 
shown to be in closed form. The theory that is developed is also illustrated through exam- 
ples to demonstrate that the methodology is readily applicable. 
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Appendix/Supplemental Section 

Proof of Theorem I5.il Let us first simplify the expression by integrating out the terms 
involving Da's. 

/I m - 1 

exp{--tr((LD _1 L') u)} Dl^dLdD 

/i m 
exp{--tr (ZT 1 (L'UL))} Y] D'^'dLdD 

/] m m i 

exp{-- ^ Dz^UUUu) Y] D^'dDdL 

/( m r i _i 
J] J cxp{--D^(L t UL) u }D: i i a, dD ii \dL 

r r(f -1)2^ 

= r — JL (if and only if a, > 2 V z = 1, 2, • • • , m) 

J \J ((DUL),^- 1 

/m 
n 

/m 
D 



i=1 ((Z/C/L),) 

m r(f - i)2t- ! 

~ — dL 

U ((LtYUL.t)?- 1 

r(f -i)2t-i 

■ ■ / \ / v v — 

i=l //-, \ / Uti Un±\ 1 



r(f -1)2? 



T 1 



^) 



t/n %W 1 ^ ' 



—dL <n . eqn(A) 



We now show how in general one can evaluate an integral of the form 

dx 



f 



where the block partitioned matrices, formed by a e R, b e R d and the (J - 1) x (d - 1) 
matrix A, is positive definite. In order to simplify the above integral we proceed in two 
steps. 

1) We first note that by the formula provided on [?, page 16] that, 



r i f vatH) >i 

?r^ = ^ rw 2 ' 

Jr (1 + * ) y I oo otherwise. 
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By repeated application, we can generalize the above formula to 

f — 

2) Let us now consider the general integral 



-dx = i r (r) x 2' 
+ l) r oo otherwise. 



f 



rfx 



t\ h\\y 



Making the linear transformation y = A^x + A~ib it follows that for y > |, 



r jx = 1 r i d 

((l x')f a A^V det(A)3 Jr^ (y f y + a-b'A-'b) y y 



b A x 



(Vi) J r( r -f) 

(12.1) 



r(y) det(A)T (a - WA^h) 7 ' 1 
Applying the result from Equation (112.11) to the desired integral in Equation(A) we obtain 

r(f -1)2t-i 



■ n f — — 

/=i J • //j J, j a ' \/ i \\ 



z^(£/, or) 



n 



r(a - ffi - ^T-^v^r-deKC/^v)^-^-! 



i=1 det(t/^>)T-¥-i 

where det(£/ <( y) := 1 whenever pa(i) = 0. It is easily seen that zg(U, a) is finite if and only 
if a, > pat + 2 for each i = 1, . . . , m. □ 

Proof of Lemma [3T71 The likelihood of the data is given by 

/(yi,y 2 , • • • ,y n \ L,D) = -L exp{-Ur(LD- l L'(nS))} det(£>H". 

(V2^)" m 2 v / 

When using n^f as the prior for (D, L), the posterior distribution of (D, L) given the data 
(Yi,Y 2 , ••• , Y„) is given by 

1 m 

n I / \ l — r ' 

^(L,D|Y 1 ,Y 2 ,---,Y„)cxexp{--tr(LD- 1 L ; (n5+t/))}n^ 2 , (D,L)£& . 

1 i=i 

(12.2) 

Hence the functional form of the posterior density is the same as that of the prior density, 
i.e., 

7rJ a (-|Y 1 ,Y 2 ,---,Y, 1 ) = ^(0, 
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where U = nS + U and a = (a\ + n, . . . , a m + n). 

Proof of Theorem Wl\ First consider the bijective mapping from the Cholesky parame- 
terization to the D-parameterization: 

c/> := ((D, L) h> x ieV (D a ,L <n )) : &g -> Eg, (12.3) 

with the inverse mapping {y~iev{^hP<n) ^ (D,L)) : Eg — » O^, where Z) = diag(/li . . . , A m ) 
and 

Ljj = Ay z ' 6 P«C/) 
otherwise 

V 

Note that /3 <; ] = (/?/; : i e pa(j)) belongs to R <;] . Now nff a naturally induces a prior on Eg 

which we shall denote by n^f a . As noted in Remark [4721 d> in Equation (112.31 ) is simply a 
permutation of the entries of D and L, hence its Jacobian is equal to 1 . To derive the density 
of 7r~ & a it suffices to find an expression for tr((LD~ l L')U) in terms of rLev(Av>£<i])- To this 
end, we proceed as follows. 

trdLD-'L'W) = tr(D- l L')UL) = ^ D£(JlUL)u 

ieV 

= Y J D li l ^Y J L ki U kl L li ) 

ieV kJeV 
ieV 



ieV 



Therefore, the density of n^f a w.r.t. the Lebesgue measure Yliev dAid/3^] on x,- e y(R+, R <l] ) 
is given by 

Zff(a, Uy [ exp{-i Yj (tftf^ + U-lU^U^ifi^ + U^U <n ) + Aj l U h]<i> )} J~[ '■ 

ieV ieV 

(12.4) 

The above clearly shows that {(A^fi^q) : i = 1, . . . ,m} are mutually independent. To com- 
plete the proof we first integrate out to obtain the marginal density of At. Notice that 
the expression involving in Equation (112.41) is an unnormalized multivariate normal 
integral and thus Equation (112.41) can be expressed as follows: 

f exp{-i Yu + U^U^yU^ifi^ - U~Xu <n ) + A-'U^} J~[ A^'dtf^) 

J R< ' ] ieV ieV 

(12.5) 



oc expi—Aj'Uu^YlAp* 



-\ai+\pai 



ieV 
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The above shows that Aj ~ IG{a\/2-pa\/2- 1, Uu^/2). It is evident from Equation (|12.5I) 
that fi^lAj ~ Npa^-U'lU^AiU'l). It is also immediately clear that the same result holds 
for elements of the Cholesky-parameterization (D ih L <{ \), i = 1, 2, . . . ,m as specified in the 
statement of the theorem. 

Proof of Corollary \6. 1 1 From Theorem I6TH 



J 



1 



x 



(27r)w/2 det(A;t/~>) 1/2 
(l/2[/,| <iV ) a ' /2 ^' /2 " 1 



— exp{(L^ + U-^U^iD^U^XL^ + U^U <n )} 



D 



-aj/2+pcij/2 



T( ai /2- pat/2 -\) 



espi-l/lUu^D^dDa 



{U ii\<i>) a " f a 

(at/2- pat/2-1) J " 



/2, 



2uil2-l Jl pa i l2Y(f 

' 2 a ^- l nP a ^Y{ai/2 - pat/2 - 1) 



x 



expi-UjD^dDu 
r(at/2 - 1) 



u 



ai/2-l 



where M; = l/2U iiHi> + (L <n + U <{] y U ^{L^ + U^U <t] ). Therefore the density of L <n 
is given by 



c,[l/2t/ !;H , v + (L <t] + U^U^yU^iL^ + U-^U <n )] 



-OTj/2+1 



(12.6) 



By Theorem 16.11 L<g are mutually independent, hence the form of the density in the 
statement of the corollary is immediate from the above calculations. The parameters cor- 
responding to the t-distribution follow by comparing the density in Equation (112.61 ) to the 
functional form of the density of the multivariate t-distribution. 

Proof of Corollary \7.1\ 

By definition, the Laplace transform of (A, x) at (£, u) e R x R p is 
J exp{-(^ + u'x)}dN p Qi, X¥)(x)dlG(y, tj)(A) 
= J exp{-^} lj exp{-« t x)}dNp0i,AY)(j:)jdIG(v,J7)(A) 



= J exp{-A4}exp{-u'id + -Au tx ¥u}dlG(v,r])(A) 

= f exp{-^} exp{-w'/i + -Au ty ¥u} I — expl-Tyi" 1 }^" 1 ! dA 
J 2 \r(v) 



r(v) 

2rf 
2 



exp{-u'fi} ^ 
exp{-u'fi} 



exp{-(£ - ^u tx ¥u)A - r]A~ l }A- v - l dA 



(%-\u tx ¥u\ 



2^--u^u) 



1 



2^--u^u) 
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of N p Qx, AW) at u is equal to exp{-u'/u + ^Au'Wu}. For computing the integral w.r.t. dA we 
use the Equation (9.42) in [[231 page 235]. 



Note that in computing the integral above we have used the fact that the Laplace transform 
^ p (ju, AW) at u is equa 
the Equation (9.42) i 

Proof of Lemma \7.1\ 

By definition, the Laplace transform of 7i v s a at (A, Z) e ©g is given by 

Xo g (A,Z):= j exp{-tr(AD t )-tr(ZL t )}n^(D,L)dDdL. 

Now under the change of variable <p : ©g :— > Eg defined in Equation (112.31) and the fact 
that 

m m 

tr(AD') + tr(ZL') = £ D U A U + £ (l + L'^Z^) 

i i=l 

we have 

/m m m 

exp{- £ D a Afi - £ (l + L^Z^^x^CA,-, ^])) [~[ dD u dL <n 
i i=l 1=1 

= e - m X Sg (xt 1 (A,,Z„. ] )). 
Proof of Lemma IS.il 

Let £2 6 P^, and (D, L) 6 % such that £2 = LD~ l L'. Note that for each 1 < j < i < m, 

m j 

Ckj = (LD-'L% = £ L ik L jk D~ k l = £ L ik L jk D~ kk l , (12.7) 
fc=i /t=i 

since L is lower triangular. Now from Equation (112.71 ) it follows by noting that Ljj = 1, V/', 

—(LD-'L^j = D~-j, (i, j) e E, — — (LD _1 L')„ = -D- 2 , i = 1, 2, • • • , m. 

Arrange the entries of 8 = (D, L) e ©g as D u ,{L 2 k '■ (2, k) e E, 1 < k < 2}, D 22 , {A/t : 
(3,k) e E,l < k < 3}, • • • , A«-i ,m-i> : € £, 1 < < m},D mm , and the entries 

offi e Pg as Qn, {Oat : (2,Jk) e £, 1 < it < 2},Q 22 ,{n 3/t : (3,*) e £, 1 < * < 
3}, • • • , n m _i >m _i, {Q. mk : (m, fc) 6 £, 1 < k < m], Q. mm . From (|12.7I) it is easily seen that n, y 
depends on {Lj k : (j, k) e E, 1 < k < j}, {L ik : (i, k) e E, 1 < k < j) and [D kk , 1 < k < j}, 
hence it is clear that n,y is functionally independent of elements of ©g that follow it in the 
arrangement described above. Hence the gradient matrix of iff (with this arrangement) is a 
lower triangular matrix, and the Jacobian of <// is therefore given as 

m ( \ m 

II II" II" 

i'=l \jech(i) ) (=1 

It follows from the expression above that the Jacobian of <// is 

m 

II" ' 
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Proof of Proposition ^. 71 



m 

E[Q\D] = E[LD~ l L'\D] = E[^ D^L^D] 

i=l 

=Z D *' E[ (( / 3i, 1 ) [i -' j -i i )" iDi 



~'L D *(-u£u < e^Jd]) • (12 - 8) 

The conditional expectation in Equation (|12.8I) can be noted by computing the following: 

Elfi^'JD] = Yartf^D] +E[]3 <i] \D]B[/3 <i] \D]' 

= D H U:l + U^U^U^U^ (12.9) 

Now since ZX 1 ~ G(a,/2 - pat/2 - 1, 2UT^ <h ), EfZX 1 ] = (a t - pa t - 2)U^ <t> and therefore 



Froi = V I {ai ~ m ~ 2)C/ i> {ai ~ m ~ 2 K~ u £i> u v> u t) \ 

L J Aj[ (ai - pai -2)(-U-J > U <ll U7 il \ i> ) U^ + iai-pai-lW^U^UT^U^U^j 

m 
i=l 

m m 

= £(ar, - pa t - 2) - ~ P"i ~ 3) • 

1=1 !=1 

Proof of Proposition^^ 

First recall that from Equation (14.51) that for each i e V 

(i) Z a = Ai +^2^9 <fl = ^ + tr(^ lV /?^[ iV ) and 
(z'z) E< a = X <t> P <lV 

Starting from the largest index m 6 V we have 2 [m> = and 2„„„ = /l m . Therefore E[2 mm ] = 
— as Aj ~ IG(^f - 1, li/akiv)- Now suppose that for some 1 < i < m the expected 

values of E# for all k, I e pa(i), has been calculated, i.e., E[Z <iV ] is known. Using part 
(if) above and the fact that ± S"^^,-] due to the mutual independence property of 
{(Da, L <( ]) : i = 1, . . . ,m} as given by Theorem 16. II we obtain 

E[E^] = -E&^W^U^. 
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Now applying part (i), Equation (16.11) from Theorem 16. H and Equation (112.91) we obtain 
E[Z, 7 ] = E[Ai] + tr(E[E^^^[ iV ]) 



U " ] ' +tr(E[S^ v ]E[E[ J 8^|Ai]]) 



a, - pcii - 4 



U ** + tr (E[S, iV ] { BttE^L + u^U^Ut 



at - pat - 4 \ \aj - pat - 4 

13 Hausdorff measures 

In order to derive the density of 7r^f a we begin with a short introduction to Hausdorff mea- 
sures. The reader is referred to [|2l Section 19] for more details on this topic. 

Let = (X, d) be a metric space, 6 a non-negative real number and V a subset of X. 
A tf-cover of V is a finite or infinite sequence {£/,, i e /} of subsets of X such that 

V c U{Uj : i e 1} and d(Uj) = sup{d(x,y) : x,y e Uj) < 6, Vi e /. 

Given r > we define a set function 

ft r 6J r(V) := inf |U J([/,) r : {U t : i e 1} is a 5-cover of vj . (13.1) 

Note that the infimum is taken over all 5-covers of V. If no such cover exists, then the infi- 
mum is +oo. The r-dimensional Hausdorff (outer) measure < H r 9 is now defined as follows. 

o— »U 

where c r is an optional normalizing constant. From Equation (113.11) it is clear that when 
V is a subset of X c X it is enough to include 5-covers consisting of the subsets of Xq 
alone. In this framework, when is the Euclidean space (R", || • ||), without raising any 
ambiguity, we suppress the under-script S£ and write "K r for the r-dimensional Hausdorff 
measure on W. Furthermore, in this case we choose the normalizing constant c r to be the 
volume of the r-dimensional ball of diameter 1 in W. By incorporating this normalizing 
constant, coincides with the standard Lebesgue measure on W. 



13.1 Integration and change of variable 

We now proceed to discuss integration and change of variable in the context of Hausdorff 
measures. For a k x n matrix A 6 R kxn let us define | A \ = Vdet(A'A). More generally, if 
T : W — » is a linear mapping, then we define | T \ in terms, but clearly independent, of a 
matrix representation of T. If T is one-to-one, then < H'\T{V)) = | T \ < H r (V), for each V c 
R". 

Now suppose / is a continuous, one-to-one mapping from an open subset V of R" into R*. 
If / has continuous partial derivatives, then by Theorem 19.3 in [0 we have: 

f g(f(x))Kf(m n (dx) = f giyWXdy), (13.2) 

Jv Jf(V) 
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where J(f(x)) is the (Hausdorff) Jacobian of / defined by | Df(x) | and A n is the standard 
Lebesgue measure on R". 
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